WO2022141706A1 - 语音识别方法、装置及存储介质 - Google Patents
语音识别方法、装置及存储介质 Download PDFInfo
- Publication number
- WO2022141706A1 WO2022141706A1 PCT/CN2021/073773 CN2021073773W WO2022141706A1 WO 2022141706 A1 WO2022141706 A1 WO 2022141706A1 CN 2021073773 W CN2021073773 W CN 2021073773W WO 2022141706 A1 WO2022141706 A1 WO 2022141706A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature vector
- vector sequence
- hot word
- sequence
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Definitions
- the present application relates to the technical field of speech recognition, and in particular, to a speech recognition method, device and storage medium.
- Embodiments of the present application provide a speech recognition method, device, and storage medium, which can improve the accuracy of hot word recognition.
- an embodiment of the present application provides a speech recognition method, the method comprising:
- a decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
- an embodiment of the present application provides a speech recognition device, the speech recognition device includes: an audio encoder module, a hot word text encoder module, a hot word audio encoder module, a frame-level attention module, and a decoder module ,in,
- the audio encoder module is used to encode the speech data to be recognized to obtain the first feature vector sequence
- the hot word text encoder module is used to encode each hot word in the preset hot word library to obtain a second feature vector sequence
- the hot word audio encoder module is used to encode the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence
- the frame-level attention module is configured to perform a first attention operation on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence;
- the decoder module is configured to perform a decoding operation according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
- an embodiment of the present application provides an electronic device, including: a processor, a memory, a communication interface, and one or more programs; wherein the one or more programs are stored in the memory, and are configured by The above-mentioned processor is executed, and the above-mentioned program includes instructions for executing steps in any method of the first aspect of the embodiments of the present application.
- an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the computer program as described in the first embodiment of the present application.
- an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute as implemented in the present application. Examples include some or all of the steps described in the first aspect.
- the computer program product may be a software installation package.
- the speech recognition method, device and related products described in the embodiments of the present application encode the speech data to be recognized to obtain the first feature vector sequence; encode each hot word in the preset hot word database to obtain The second feature vector sequence is to encode the audio segment of each hot word in the preset hot word library to obtain the third feature vector sequence, and the first attention operation is performed on the first feature vector sequence and the third feature vector sequence to obtain For the fourth feature vector sequence, the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained. Because not only the hot word text information is used as the input, but also the corresponding audio segment is used as the input.
- FIG. 1A is a schematic structural diagram of a speech recognition model provided by an embodiment of the present application.
- FIG. 1B is a schematic flowchart of a speech recognition method provided by an embodiment of the present application.
- FIG. 1C is a schematic diagram of a demonstration of hot word encoding provided by an embodiment of the present application.
- FIG. 1D is a schematic diagram of a demonstration of feature splicing provided by an embodiment of the present application.
- FIG. 2 is a schematic flowchart of another speech recognition method provided by an embodiment of the present application.
- FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- 4A is a block diagram of functional units of a speech recognition device provided by an embodiment of the present application.
- FIG. 4B is a block diagram of functional units of another speech recognition apparatus provided by an embodiment of the present application.
- the electronic devices involved in the embodiments of the present application may include various handheld devices with speech recognition functions, voice recorders, smart robots, smart readers, smart translators, smart headphones, smart dictionaries, smart point readers, and in-vehicle devices. , wearable devices, computing devices or other processing devices connected to wireless modems, as well as various forms of user equipment (UserEquipment, UE), mobile station (MobileStation, MS), terminal device (terminal device), etc., electronic equipment can also For servers or smart home devices.
- UE user equipment
- MS mobile station
- terminal device terminal device
- the smart home device may be at least one of the following: refrigerators, washing machines, rice cookers, smart curtains, smart lights, smart beds, smart trash cans, microwave ovens, ovens, steamers, air conditioners, range hoods, servers, smart Doors, smart windows, window and door wardrobes, smart speakers, smart homes, smart chairs, smart drying racks, smart showers, water dispensers, water purifiers, air purifiers, doorbells, monitoring systems, smart garages, TVs, projectors, Smart dining table, smart sofa, massage chair, treadmill, etc.
- FIG. 1A is a speech recognition model provided by an embodiment of the present application.
- the speech recognition model includes: an audio encoder module, a hot word text encoder module, a hot word audio encoder module, and a frame-level attention module , a word-level attention module and a decoder module, the decoder module may include a decoder, and the speech recognition model can be used to implement the speech recognition function, as follows:
- the electronic device can also use the hot word text encoder module to independently encode each hot word in the preset hot word database, so as to encode the hot words of different lengths into fixed-dimensional vectors, and obtain a set of features representing the hot words.
- the preset hot thesaurus can be set in advance according to the user's needs. For example, the corresponding hot words can be adapted from the basic hot thesaurus according to one's own identity or occupation. Identity or occupation is pre-established.
- the preset hot word database can also be automatically generated according to the user's historical situation. For example, during the use process, the user can automatically generate the preset hot word database for the hot words that appear during the use process. For another example, in a voice assistant scenario, the user's address book names can be read as hot words, and a preset hot word library can be generated according to these hot words. For another example, in the process of use, such as the input method, after legal authorization, some entity texts input by the user in pinyin, such as place names and personal names, can be remembered as hot words, and these hot words can be generated into a preset hot word database. .
- the preset hot word library can be saved locally or in the cloud.
- the audio segment of each hot word in the preset hot word library can be independently encoded by the hot word audio encoder module, and the hot word audio encoder module here can be shared with the aforementioned audio encoder module. It can be understood that the two are the same encoder. Therefore, the hot word audio clips of different lengths can also be encoded into fixed-dimensional vectors, that is, the encoder of the last frame of the hot word audio clip can be used or the average of all the hot word audio clips can be used.
- the frame-level attention module performs the attention operation at the frame level on the audio coding representation of each frame (the first feature vector sequence) and the hot word audio coding representation (the third vector feature sequence), and fuses the hot word information, form a new audio coding representation, the fourth sequence of feature vectors
- two ways can be used to perform the decoding operation, as follows:
- the word-level attention module uses the state vector d t output by the decoder module at time t and the fourth feature vector sequence output by the frame-level attention module.
- the second feature vector sequence H z output by the hot word text encoder and the third feature vector sequence H w output by the hot word audio encoder module are input, and the attention mechanism is used to calculate the audio context for predicting the t-th character.
- the feature vector C t x , the hot word audio context feature vector C tw and the hot word text context feature vector C t z are input into the decoder module to complete decoding.
- the decoder module can directly convert the fourth feature vector sequence output by the frame-level attention module
- the second feature vector sequence H z output by the hot word text encoder and the third feature vector sequence H w output by the hot word audio encoder are input into the decoder to complete decoding.
- the hot word excitation effect can be significantly improved, and then the three are decoded to improve the hot word recognition effect, thereby improving the hot word recognition accuracy.
- FIG. 1B is a schematic flowchart of a speech recognition method provided by an embodiment of the present application. As shown in the figure, the speech recognition method shown in FIG. 1B is applied to the speech recognition model shown in FIG. 1A , The speech recognition model is applied to electronic equipment, and the speech recognition method includes:
- the voice data to be recognized may be pre-stored or real-time collected voice data or a sequence of voice feature vectors, and the voice data may be at least one of the following: recording data, real-time recording recording data, and extracting from video data recording data, synthesized recording data, etc., which are not limited here.
- the speech feature vector sequence may be at least one of the following: Filter Bank feature, Mel Frequency Cepstrum Coefficient (MFCC) feature, Perceptual Linear Predictive (PLP) feature, etc., which are not limited here.
- the electronic device can perform feature extraction on the voice data to obtain a sequence of voice feature vectors, and then encode the sequence of voice feature vectors to obtain a first sequence of feature vectors.
- the electronic device can directly encode the sequence of speech feature vectors to obtain a first sequence of feature vectors.
- the electronic device can encode the speech data to be recognized through an audio encoder module to obtain a first feature vector sequence
- the audio encoder module can be composed of one or more layers of encoding layers. It can be a long-short-term memory layer in a long-short-term memory neural network (LSTM) or a convolutional layer of a convolutional neural network, and the long-short-term memory neural network can be a one-way or two-way long-term memory neural network. time memory layer.
- LSTM long-short-term memory neural network
- LSTM long-short-term memory neural network
- time memory layer time memory layer.
- the preset hot word database may be stored in the electronic device in advance, and the preset hot word database may include text information of multiple hot words.
- the electronic device can encode each hot word in the preset hot word database through the hot word text encoder module to obtain the second feature vector sequence.
- the preset hot word library may also be pre-stored on other servers, and the preset hot word library may be obtained by accessing.
- the number of words contained in different hot words can be the same or different. If the number of words is different, for example, the Japanese hot word "Tokyo” has two characters and "Kanagawa” has three characters, you can input variable length Represented by a fixed-dimensional vector to facilitate model processing.
- the function of the hot word text encoder module is to encode hot words of different lengths into fixed-dimensional vectors, which can be included as one or more layers of encoding layers, which can be long and short-term memory layers in long-short-term memory neural networks or
- the convolutional layer of the convolutional neural network, the long-short-term memory neural network can be a long-short-term memory layer in a one-way or two-way long-short-term memory neural network.
- the bidirectional long and short-term memory layer has better coding effect on hot words than the one-way long and short-term memory layer.
- ", "Nai”, “Chuan” three words a layer of bidirectional long and short-term memory layer of the hot word encoder to encode it is shown in Figure 1C, the left side of the figure is the forward direction of the bidirectional long and short-term memory layer part, the right side is the reverse part, splicing the output vectors h f z and h b z of the forward and reverse last step, the obtained vector h z is the encoding vector representation of the hot word, the encoding vector of multiple hot words Represents that a second sequence of feature vectors can be formed.
- the electronic device can encode the audio segment of each hot word in the preset hot word database through the hot word audio encoder module to obtain the third feature vector sequence.
- the third feature vector sequence can represent the audio information contained in the hot word audio segment.
- the hot word audio encoder module and the above audio encoder module can be shared, that is, the two can share algorithms, such as: both are the same encoder, and the hot word audio encoder module can also include one or more layers Coding layer, the coding layer can be a long-short-term memory layer in a long-short-term memory neural network or a convolutional layer of a convolutional neural network, and the long-short-term memory neural network can be a unidirectional or bidirectional long-short-term memory neural network. .
- the hot word audio encoder module and the above-mentioned audio encoder module may also be two independent encoders, which are not limited in the present invention.
- the audio clips of hot words can be obtained in the following ways, which may include but are not limited to: intercepting from audio, artificially collecting, synthesizing using a speech synthesis system, etc., which are not limited here, and finally the audio clips of hot words can be obtained,
- the audio segment of the hot word may be pre-stored, or may be an audio segment synthesized based on the hot word.
- the last frame can represent the information of the entire audio sequence.
- the last frame may not be taken, for example, the average of all frames may be taken.
- z 0 is a special hot word " ⁇ no-bias>", which means that there is no hot word. In the specific implementation, it can be replaced by the average value of all hot word vectors.
- all hot words The vectors can be all vectors of at least one of the second feature vector sequence and the third feature vector sequence.
- no-bias> is motivated to deal with cases where no hot words exist in the speech or the speech segment being recognized is not a hot word.
- the electronic device can perform the first attention operation on the first feature vector sequence and the third feature vector sequence through the frame-level attention module, so as to realize the fusion of the features of the two to obtain the fourth feature vector sequence, so that , which can significantly improve the hot word incentive effect.
- the function of the frame-level attention module is to fuse the text information of the hot words in the preset hot word database for the output of the audio encoder module of each frame to form a fourth feature vector sequence with the representation of the hot word information, so that each The audio representation (the first sequence of feature vectors) of the to-be-recognized speech data of the frame is more robust to hot words.
- a first attention operation is performed on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence, including:
- the electronic device can associate the query item with each feature vector in the third feature vector sequence
- the matching coefficient is calculated based on the attention mechanism. For example, the matching coefficient is calculated by calculating the inner product method or the feature distance method and normalized to obtain the matching coefficient W n , and then the matching coefficient W n is calculated with the corresponding feature vector h n z .
- the operation method can be any of the following: point multiplication and summation, weighted operation, inner product, etc., which are not limited here. After the operation, a new feature vector h i z can be obtained, which is the feature that best matches the query item.
- this frame-level attention module is to add content containing the audio information of the hot words in the preset hot word library during the audio coding process, so it is more conducive to the decoding accuracy of the hot words of the subsequent decoding module.
- the electronic device may input the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to the decoder for decoding operation to obtain the recognition result, or the electronic device may also input the second feature vector sequence
- the sequence, the third feature vector sequence, and the fourth feature vector sequence first perform the second attention operation, and then input the result to the decoder for decoding operation to obtain the recognition result.
- a decoder can contain multiple neural network layers.
- the decoding operation method can be Beam Search decoding, of course, can also be other decoding methods, which will not be repeated here.
- a large amount of voice data with text annotations can be collected, and its voice features can be extracted, and the voice features can be at least one of the following: PLP, MFCC, FilterBank, etc., which are not limited here.
- the text-annotated speech data collected here can be used to train the Hot Word Audio Encoder module.
- the speech feature sequence and text annotation sequence of a certain speech data can be expressed as follows:
- Speech feature sequence X [x 1 ,x 2 ,...,x k ,...,x K ]
- Text annotation sequence Y [y 0 ,y 1 ,...,y t ,...,y T ]
- x k represents the k-th frame speech feature vector in the speech feature sequence X
- K is the total number of speech frames
- y t represents the t-th character in the text annotation sequence Y
- T+1 is the total number of characters in the total text annotation
- y 0 is the sentence start symbol " ⁇ s>”
- y T is the sentence end symbol " ⁇ /s>”.
- the speech recognition model can have the ability to support arbitrary hot word recognition, which means that hot words cannot be limited in model training. Therefore, in this embodiment of the present application, annotated segments can be randomly selected as hot words from the text annotations of the training data to participate in the entire model training.
- B is an integer greater than 1, and will be described in detail.
- P and N can be set, where P is the probability of whether to select a hot word in the training data of a certain sentence, and N is the maximum number of words of the selected hot word.
- the labeling comparison before and after the hot words can be selected from the sentence as shown in the following table:
- the special tag “ ⁇ bias>” can be added after it.
- the role of " ⁇ bias>” is to introduce training errors to force the model parameters related to hot words to be updated during model training, such as the model parameters of the hot word audio encoder module or the model parameters of the hot word text encoder module.
- “dong” and “jing” are selected as hot words, they can be added to the hot word list of this model update as the input of the hot word audio encoder module or the hot word text encoder module.
- the hot word selection work is performed independently for each model update, and the hot word list can be empty at the initial moment. After processing the data, the model parameters can be updated using neural network optimization methods.
- the sample data and the real recognition result corresponding to the sample data are obtained, the sample data can be encoded to obtain the first feature vector sequence, and each hot word in the preset hot word database is encoded to obtain the second feature vector sequence, encode the audio segment of each hot word in the preset hot word library, obtain the third feature vector sequence, perform the first attention operation on the first feature vector sequence and the third feature vector sequence, and obtain the fourth feature vector Sequence, decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain the predicted recognition result, and the model parameters are updated according to the deviation between the actual recognition result and the preset recognition result.
- the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result, which may include the following steps:
- A51 Perform a second attention operation on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence, a hot word audio context feature vector sequence and audio context feature vector sequence;
- A52 Input the hot word text context feature vector sequence, the hot word audio context feature vector sequence, and the audio context feature vector sequence to a decoder to perform a decoding operation to obtain a recognition result.
- the electronic device can perform the second attention operation on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature Vector sequence, the function of the word-level attention module is to extract the audio-related feature vector and hot word correlation required at the moment to be decoded from the audio feature vector sequence, the hot word text feature vector sequence and the hot word audio feature vector sequence at each decoding moment.
- Feature vector
- the audio-related feature vector represents the audio content of the character to be decoded at the t-th time
- the hot word text-related feature vector represents the possible hot words at the t-th time.
- Text content hot word audio-related feature vector represents the possible hot word audio content at time t.
- the attention mechanism can use a vector as the query item (query), perform the attention mechanism operation on a set of feature vector sequences, and select the feature vector that best matches the query item as the output, specifically: Calculate a matching coefficient between the query item and each feature vector in the feature vector sequence, and then multiply and sum these matching coefficients with the corresponding feature vector to obtain a new feature vector, which is the feature vector that best matches the query item. .
- the second attention operation is performed on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence , hot word audio context feature vector sequence and audio context feature vector sequence, which can include the following steps:
- A512 Perform an attention operation on the second feature vector sequence according to the first state feature vector to obtain the hot word text context feature vector sequence at the current moment;
- A513. Perform an attention operation on the third feature vector sequence according to the first state feature vector to obtain the hot word audio context feature vector sequence at the current moment;
- A514. Perform an attention operation on the fourth feature vector sequence according to the first state feature vector to obtain the audio context feature vector sequence at the current moment.
- the audio context feature vector sequence c t x can be obtained, since the hot words participate in calculation, It contains the complete audio information of the potential hot words, and the c t x calculated in this way also contains information about whether the hot words are included and which hot words are.
- the hot word text encoding module uses d t as the query item, the hot word text encoding module outputs the second feature vector sequence H z to perform the attention mechanism operation, and then the hot word text context feature vector sequence c t z can be obtained; for the same reason, use d t As a query item, the hot word audio context feature vector sequence c tw can be obtained by outputting the third feature vector sequence H w from the hot word audio coding module to perform the attention mechanism operation.
- these three vectors can be spliced together and sent to the decoder module for decoding at the t-th moment.
- the ct w of the audio information of the hot words corresponding to the hot words in the library is more conducive to the decoding accuracy of the subsequent hot words.
- the second attention operation may also be performed on the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence based on the first feature vector sequence, so as to obtain hot words Text context feature vector sequence, hot word audio context feature vector sequence and audio context feature vector sequence.
- the decoder includes a first-layer unidirectional long short-term memory layer
- the above step A511, obtaining the first state feature vector of the decoder at the first historical moment may include the following steps:
- A5111 obtain the recognition result of the first historical moment and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment;
- A5112. Input the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment into the The first layer of one-way long short-term memory layer is used to obtain the first state feature vector.
- the above-mentioned first historical moment is at least one moment before the current moment, that is, the first historical moment may be a moment before the current moment, or may also be a plurality of moments before the current moment;
- the above-mentioned decoder may include Two-layer one-way long-short-term memory layer, the two-layer one-way long-short-term memory layer may include a first layer of one-way long-short-term memory layer and a second layer of one-way long-short-term memory layer.
- the recognition result of the decoder at the historical moment and the hot word text context feature vector sequence, hot word audio context feature vector sequence and audio context feature vector sequence of the first historical moment are input to the first layer of one-way long and short-term memory layer. , to obtain the first state feature vector, and further, using the recognition result of the first historical moment and the corresponding input content to perform memory (feature) fusion, which helps to improve the prediction ability of the model.
- the hot word text context feature vector sequence, hot word audio context feature vector sequence and audio context feature vector sequence at the first historical moment can be obtained by the following methods: obtaining the first state feature vector of the decoder at the first historical moment, and perform an attention operation on the second feature vector sequence according to the first state feature vector to obtain the hot word text context feature vector sequence at the first historical moment, and perform an attention operation on the third feature vector sequence according to the first state feature vector, The audio context feature vector sequence of the hot word at the first historical moment is obtained, and the attention operation is performed on the fourth feature vector sequence according to the first state feature vector to obtain the audio context feature vector sequence at the first historical moment.
- this d t-1 can be used as a query item, and the second feature vector sequence and the third feature vector sequence input at the first historical moment , the fourth feature vector sequence to perform attention operation; d t-1 can be based on the recognition result of the second historical moment and the hot word text context feature vector sequence, hot word audio context feature vector sequence and audio context feature of the second historical moment
- the vector sequence is input to the first one-way long short-term memory layer, and the first state feature vector is obtained.
- the second historical moment may be at least one moment before the first historical moment, that is, the second historical moment may be a moment before the first historical moment, or may be multiple moments before the first historical moment.
- it can also be based on all or part of the recognition results before the current moment and the hot word text context feature vector sequence, the hot word audio input to the decoder at the first historical moment
- the context feature vector sequence and the audio context feature vector sequence are input into the first layer of one-way long-term and short-term memory layer to obtain the first state feature vector; and then, the current moment is input into the decoder of the hot word text context
- the feature vector sequence, the hot word audio context feature vector sequence, and the audio context feature vector sequence are input to the second layer of one-way long-term memory layer to obtain the recognition result at the current moment.
- the decoder includes a second-layer unidirectional long-term and short-term memory layer, and in the above step A52, the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the hot word audio context feature vector sequence are combined.
- the audio context feature vector sequence is input to the decoder for decoding operation, and the recognition result is obtained, which may include the following steps:
- the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence input into the decoder at the current moment into the second layer of one-way long and short-term memory layer,
- the recognition result of the current moment is obtained, and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the current moment are respectively identified by the first state feature vector.
- the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence at the current moment are obtained by performing the second attention operation.
- the current moment can be understood as the current decoding moment.
- the first historical moment is the moment before the current moment.
- the moment before the current moment is the moment before the current moment, that is, decoding the t-1th word.
- the moment of the word is the first historical moment.
- the decoder can include two layers of one-way long and short-term memory layers. Taking the t-th character (time) as an example, when decoding the t-th character, the first layer of long-short-term memory layer uses the recognition result character y t at time t-1.
- d t is input to the word-level attention module, which is used to calculate the output c t of the word-level attention module at the t-th time.
- C t is the hot-word text context feature vector sequence, hot-word audio context feature vector sequence and Audio context feature vector sequence, then, ct is used as the input of the second long and short-term memory layer to calculate the output h t d of the decoder , and finally, the posterior probability of the output character is calculated, and the recognition result can be obtained.
- the input of hot word text and voice fragments will effectively increase the richness of hot word input information , which is bound to greatly improve the effect of hot word incentives;
- the use of double-layer incentives, that is, secondary attention operation will inevitably improve the effect of hot word incentives.
- the two together improve the hot word recognition effect, which in turn helps to improve the hot word recognition accuracy.
- the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result, which can be implemented as follows:
- the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input to a decoder for decoding operation to obtain the recognition result.
- the electronic device can directly input the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to the decoder for decoding operation to obtain the recognition result, because not only the hot word text information is used as input, but also the The corresponding audio clips are used as input, and the audio clips of the speech data to be recognized and the hot word text information are fused by attention operation and then used as input, which can significantly improve the hot word excitation effect, and then the three are decoded. It can improve the hot word recognition effect, thereby improving the hot word recognition accuracy.
- the decoder includes two layers of one-way long and short-term memory layers, and the two layers of one-way long and short-term memory layers include a first layer of one-way long and short-term memory layers and a second layer of one-way long and short-term memory layers.
- the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input into the decoder for decoding operation to obtain the recognition result, which may include the following steps :
- the decoder may include multiple neural network layers.
- the decoder may include two unidirectional long and short-term memory layers, and the two-layer unidirectional long and short-term memory layers include a first unidirectional long and short-term memory layer and a second layer. One-way long short-term memory layer.
- the electronic device can obtain the recognition result of the first historical moment and the second, third and fourth feature vector sequences input to the decoder at the first historical moment, and input them into the first layer one-way long and short-term memory layer, obtain the second state feature vector, input the second feature vector sequence, third feature vector sequence and fourth feature vector sequence input to the decoder at the current moment to the second layer one-way long and short-term memory layer , to obtain the recognition result at the current moment, wherein the fourth feature vector sequence may be the first attention paid to at least one of the first feature vector sequence and the third feature vector sequence at the current moment through the second state feature vector
- the operation is obtained, for example, the first attention operation can be performed on the first feature vector sequence and the third feature vector sequence at the current moment respectively through the second state feature vector.
- the output content of the second unidirectional long-term and short-term memory layer of the decoder can be obtained, and a posteriori probability calculation can also be performed on the output content to obtain the final decoding result, that is, the recognition result
- the speech data to be recognized is encoded to obtain a first feature vector sequence; and each hot word in the preset hot word library is encoded to obtain a second feature vector sequence , encode the audio segment of each hot word in the preset hot word library to obtain the third feature vector sequence, and perform the first attention operation on the first feature vector sequence and the third feature vector sequence to obtain the fourth feature vector sequence , the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained, because not only the hot word text information is used as input, but also its corresponding audio segment is used as input, and the Recognize the audio clips of speech data and hot word text information, perform attention operation and fuse them as input, which can significantly improve the hot word excitation effect, and then perform decoding operations on the three to improve the hot word recognition effect, thereby improving the hot word recognition effect. word recognition accuracy.
- FIG. 2 is a schematic flowchart of a speech recognition method provided by an embodiment of the present application. As shown in the figure, the speech recognition method shown in FIG. 2 is applied to The speech recognition model shown in Figure 1A, the speech recognition model is applied to electronic equipment, the speech recognition method includes:
- the input of the hot word text combined with the speech segment will Effectively increase the richness of hot word input information, which is bound to greatly improve the effect of hot word incentives;
- the use of double-layer incentives, that is, secondary attention operation is bound to improve the effect of hot word incentives.
- the two-level hot word incentive scheme complements each other, and the two jointly improve the hot word recognition effect, which in turn helps to improve the hot word recognition accuracy.
- FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- the electronic device includes a processor, a memory, a communication interface, and one or more A program, wherein the above-mentioned one or more programs are stored in the above-mentioned memory and are configured to be executed by the above-mentioned processor.
- the above-mentioned program includes instructions for executing the following steps:
- a decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
- the electronic device described in the embodiment of the present application encodes the speech data to be recognized to obtain the first feature vector sequence; encodes each hot word in the preset hot word database to obtain the second feature vector sequence, Encoding the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence, and performing the first attention operation on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence, The decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained. Since not only the text information of the hot word is used as input, but also the corresponding audio segment is used as input, and the recognition result is obtained. The voice data and the audio clips of the hot word text information are fused by attention operation and used as input, which can significantly improve the hot word incentive effect, and then perform decoding operations on the three to improve the hot word recognition effect, thereby improving the hot word. recognition accuracy.
- the above program includes the following steps: instruction:
- the matching coefficient corresponding to each third eigenvector and the corresponding third eigenvector are operated to obtain a new eigenvector corresponding to each third eigenvector;
- the new eigenvectors corresponding to the third eigenvectors are spliced with the corresponding first eigenvectors to obtain the characterization vectors corresponding to the third eigenvectors, and the characterization vectors corresponding to the third eigenvectors are converted to as the fourth feature vector sequence.
- the above program includes the following steps: Instructions for steps:
- the second attention operation is performed on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence
- the above program includes instructions for performing the following steps:
- the decoder includes a first-layer unidirectional long short-term memory layer, and in terms of acquiring the first state feature vector of the decoder at the first historical moment, the above program includes steps for performing the following steps: command:
- the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment input to the A one-way long short-term memory layer is used to obtain the first state feature vector.
- the decoder includes a second layer of one-way long-term and short-term memory, in the process of combining the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context
- the above program includes instructions for executing the following steps:
- the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence input into the decoder at the current moment into the second layer of one-way long and short-term memory layer,
- the recognition result of the current moment is obtained, and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the current moment are respectively identified by the first state feature vector.
- the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence at the current moment are obtained by performing the second attention operation.
- the above program includes the following steps: Instructions for steps:
- the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input to a decoder for decoding operation to obtain the recognition result.
- the decoder includes two unidirectional long and short-term memory layers, and the two-layer unidirectional long and short-term memory layers include a first unidirectional long and short-term memory layer and a second unidirectional long and short-term memory layer layer, in the aspect of inputting the second eigenvector sequence, the third eigenvector sequence and the fourth eigenvector sequence to the decoder for decoding operation to obtain the recognition result, the above program includes a program for executing Instructions for the following steps:
- the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence input to the decoder according to the recognition result of the first historical moment and the input of the first historical moment are input to the decoder. Describe the first layer of one-way long and short-term memory layer to obtain the second state feature vector;
- the fourth feature vector sequence is to perform the first attention operation on the first feature vector sequence and the third feature vector sequence at the current moment through the second state feature vector and get.
- encoding the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence including:
- the audio segment of each hot word in the preset hot word library is encoded by one or more layers of encoding layers to obtain the third feature vector sequence
- the encoding layer includes: long and short-term memory neural network A memory layer or a convolutional layer of a convolutional neural network, where the long-short-term memory layer in the long-short-term memory neural network is a long-short-term memory layer in a unidirectional or bidirectional long-short-term memory neural network.
- the electronic device includes corresponding hardware structures and/or software modules for executing each function.
- the present application can be implemented in hardware or in the form of a combination of hardware and computer software, in combination with the units and algorithm steps of each example described in the embodiments provided herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
- the electronic device may be divided into functional units according to the foregoing method examples.
- each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and other division methods may be used in actual implementation.
- FIG. 4A is a block diagram of functional units of the speech recognition apparatus 400 involved in the embodiment of the present application.
- the speech recognition apparatus 400 is applied to electronic equipment, and the speech recognition apparatus 400 includes: an audio encoder module 401, a hot word text encoder module 402, a hot word audio encoder module 403, a frame-level attention module 404, and a decoder Module 405, wherein,
- the audio encoder module 401 is used to encode the speech data to be recognized to obtain a first feature vector sequence
- the hot word text encoder module 402 is used to encode each hot word in the preset hot word database to obtain a second feature vector sequence
- the hot word audio encoder module 403 is configured to encode the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence
- the frame-level attention module 404 is configured to perform a first attention operation on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence;
- the decoder module 405 is configured to perform a decoding operation according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
- the speech recognition device described in the embodiment of the present application encodes the speech data to be recognized to obtain the first feature vector sequence; encodes each hot word in the preset hot word database to obtain the second feature vector sequence , encode the audio segment of each hot word in the preset hot word library to obtain the third feature vector sequence, and perform the first attention operation on the first feature vector sequence and the third feature vector sequence to obtain the fourth feature vector sequence , the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained, because not only the hot word text information is used as input, but also its corresponding audio segment is used as input, and the Recognize the audio clips of speech data and hot word text information, perform attention operation and fuse them as input, which can significantly improve the hot word excitation effect, and then perform decoding operations on the three to improve the hot word recognition effect, thereby improving the hot word recognition effect. word recognition accuracy.
- the frame-level attention module 404 specifically Used for:
- the matching coefficient corresponding to each third eigenvector and the corresponding third eigenvector are operated to obtain a new eigenvector corresponding to each third eigenvector;
- the new eigenvectors corresponding to the third eigenvectors are spliced with the corresponding first eigenvectors to obtain the characterization vectors corresponding to the third eigenvectors, and the characterization vectors corresponding to the third eigenvectors are converted to as the fourth feature vector sequence.
- its decoder module 405 may include: a word-level attention module 4051 and decoder 4052, where,
- the word-level attention module 4051 is configured to perform a second attention operation on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence , Hot word audio context feature vector sequence and audio context feature vector sequence;
- the decoder 4052 is configured to input the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence to the decoder for decoding operation to obtain a recognition result.
- the second attention operation is performed on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence
- the word-level attention module 4051 is specifically used for:
- the decoder includes a first-layer unidirectional long short-term memory layer, and in the aspect of acquiring the first state feature vector of the decoder at the first historical moment, the word-level attention module 4051 Specifically for:
- the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment input to the A one-way long short-term memory layer is used to obtain the first state feature vector.
- the decoder includes a second layer of one-way long-term and short-term memory, in the process of combining the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context
- the feature vector sequence is input to the decoder for decoding operation, and in terms of obtaining the recognition result, the decoder 4052 is specifically used for:
- the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence input into the decoder at the current moment into the second layer of one-way long and short-term memory layer,
- the recognition result of the current moment is obtained, and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the current moment are respectively identified by the first state feature vector.
- the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence at the current moment are obtained by performing the second attention operation.
- the decoder module 405 specifically Used for:
- the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input to a decoder for decoding operation to obtain the recognition result.
- the decoder includes two unidirectional long and short-term memory layers, and the two-layer unidirectional long and short-term memory layers include a first unidirectional long and short-term memory layer and a second unidirectional long and short-term memory layer layer, in the aspect of inputting the second eigenvector sequence, the third eigenvector sequence and the fourth eigenvector sequence to the decoder for decoding operation to obtain the recognition result, the decoder module 405 Specifically for:
- the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence input to the decoder according to the recognition result of the first historical moment and the input of the first historical moment are input to the decoder. Describe the first layer of one-way long and short-term memory layer to obtain the second state feature vector;
- the fourth feature vector sequence is to perform the first attention operation on the first feature vector sequence and the third feature vector sequence at the current moment through the second state feature vector and get.
- the hot word audio encoder module 403 is specifically configured to:
- the audio segment of each hot word in the preset hot word library is encoded by one or more layers of encoding layers to obtain the third feature vector sequence
- the encoding layer includes: long and short-term memory neural network A memory layer or a convolutional layer of a convolutional neural network, where the long-short-term memory layer in the long-short-term memory neural network is a long-short-term memory layer in a unidirectional or bidirectional long-short-term memory neural network.
- Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute part or all of the steps of any method described in the above method embodiments , the above computer includes electronic equipment.
- Embodiments of the present application further provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of the method embodiments described above. some or all of the steps of the method.
- the computer program product may be a software installation package, and the computer includes an electronic device.
- the disclosed apparatus may be implemented in other manners.
- the device embodiments described above are only illustrative.
- the division of the above-mentioned units is only a logical function division.
- multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
- the units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
- the above-mentioned integrated units if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable memory.
- the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory.
- a computer device which may be a personal computer, a server, or a network device, etc.
- the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
Claims (21)
- 一种语音识别方法,其特征在于,所述方法包括:对待识别语音数据进行编码,得到第一特征向量序列;对预设热词库中每一热词进行编码,得到第二特征向量序列;对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
- 根据权利要求1所述的方法,其特征在于,所述将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列,包括:将所述第一特征向量序列中的各个第一特征向量与所述第三特征向量序列中的每一第三特征向量进行匹配运算,得到各第三特征向量对应的匹配系数;将所述各第三特征向量对应的匹配系数与对应的第三特征向量进行运算,得到所述各第三特征向量对应的新特征向量;将所述各第三特征向量对应的新特征向量与对应的所述第一特征向量进行拼接,得到所述各第三特征向量对应的表征向量,将所述各第三特征向量对应的表征向作为所述第四特征向量序列。
- 根据权利要求1或2所述的方法,其特征在于,所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果,包括:将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列;将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
- 根据权利要求3所述的方法,其特征在于,所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列,包括:获取当前时刻所述解码器的第一状态特征向量;依据所述第一状态特征向量对所述第二特征向量序列进行注意力操作,得到所述当前时刻的所述热词文本上下文特征向量序列;依据所述第一状态特征向量对所述第三特征向量序列进行注意力操作,得到所述当前时刻的所述热词音频上下文特征向量序列;依据所述第一状态特征向量对所述第四特征向量序列进行注意力操作,得到所述当前时刻的所述音频上下文特征向量序列。
- 根据权利要求4所述的方法,其特征在于,所述解码器包括第一层单向长短时记忆层,所述获取第一历史时刻所述解码器的第一状态特征向量,包括:获取第一历史时刻的识别结果以及该第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列;依据所述第一历史时刻的识别结果以及所述第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到所述第一状态特征向量。
- 根据权利要求3所述的方法,其特征在于,所述解码器包括第二层单向长短时记忆层,所述将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果,包括:将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序 列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述当前时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列为通过第一状态特征向量分别对所述当前时刻的第二特征向量序列、所述第三特征向量序列以及所述第四特征向量序列进行所述第二注意力操作而得到。
- 根据权利要求1所述的方法,其特征在于,所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果,包括:将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果。
- 根据权利要求7所述的方法,其特征在于,所述解码器包括两层单向长短时记忆层,所述两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层,所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果,包括:获取第一历史时刻的识别结果以及该第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列;依据所述第一历史时刻的识别结果以及所述第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第一层单向长短时记忆层,得到第二状态特征向量;将当前时刻输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述第四特征向量序列为通过所述第二状态特征向量对所述当前时刻的所述第一特征向量序列和所述第三特征向量序列进行所述第一注意力操作而得到。
- 根据权利要求1-8任一项所述的方法,其特征在于,所述对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,包括:通过一层或者多层编码层对所述预设热词库中每一热词的音频片段进行编码,得到所述第三特征向量序列,所述编码层包括:长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,所述长短时记忆神经网络中长短时记忆层为基于单向或者双向的长短时记忆神经网络中长短时记忆层。
- 一种语音识别装置,其特征在于,所述语音识别装置包括:音频编码器模块、热词文本编码器模块、热词音频编码器模块、帧层级注意力模块和解码器模块,其中,所述音频编码器模块,用于对待识别语音数据进行编码,得到第一特征向量序列;所述热词文本编码器模块,用于对预设热词库中每一热词进行编码,得到第二特征向量序列;所述热词音频编码器模块,用于对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;所述帧层级注意力模块,用于将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;所述解码器模块,用于根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
- 根据权利要求10所述的装置,其特征在于,在所述将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列方面,所述帧层级注意力模块具体用于:将所述第一特征向量序列中的各个第一特征向量与所述第三特征向量序列中的每一第三特征向量进行匹配运算,得到各第三特征向量对应的匹配系数;将所述各第三特征向量对应的匹配系数与对应的第三特征向量进行运算,得到所述各第三特征向量对应的新特征向量;将所述各第三特征向量对应的新特征向量与对应的所述第一特征向量进行拼接,得到所述各第三特 征向量对应的表征向量,将所述各第三特征向量对应的表征向作为所述第四特征向量序列。
- 根据权利要求10或11所述的装置,其特征在于,在所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果方面,所述编码器模块具体用于:将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列;将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
- 根据权利要求12所述的装置,其特征在于,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列方面,所述编码器模块具体用于:获取当前时刻所述解码器的第一状态特征向量;依据所述第一状态特征向量对所述第二特征向量序列进行注意力操作,得到所述热词文本上下文特征向量序列;依据所述第一状态特征向量对所述第三特征向量序列进行注意力操作,得到所述热词音频上下文特征向量序列;依据所述第一状态特征向量对所述第四特征向量序列进行注意力操作,得到所述音频上下文特征向量序列。
- 根据权利要求13所述的装置,其特征在于,所述解码器包括第一层单向长短时记忆层,在所述获取第一历史时刻所述解码器的第一状态特征向量方面,所述编码器模块具体用于:获取第一历史时刻的识别结果以及该第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列;依据所述第一历史时刻的识别结果以及所述第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到所述第一状态特征向量。
- 根据权利要求10所述的装置,其特征在于,所述解码器包括第二层单向长短时记忆层,在所述将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果方面,所述编码器模块具体用于:将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述当前时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列为通过所述第一状态特征向量分别对所述当前时刻的第二特征向量序列、所述第三特征向量序列以及所述第四特征向量序列进行所述第二注意力操作而得到。
- 根据权利要求10所述的装置,其特征在于,在所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果方面,所述编码器模块具体用于:将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果。
- 根据权利要求16所述的装置,其特征在于,所述解码器包括两层单向长短时记忆层,所述两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果方面,所述编码器模块具体用于:获取第一历史时刻的识别结果以及该第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列;依据所述第一历史时刻的识别结果以及所述第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第一层单向长短时记忆层,得到第二 状态特征向量;将当前时刻输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述第四特征向量序列为通过所述第二状态特征向量对所述当前时刻的所述第一特征向量序列和所述第三特征向量序列进行所述第一注意力操作而得到。
- 根据权利要求10-17任一项所述的装置,其特征在于,在所述对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列方面,所述热词音频编码器模块具体用于:通过一层或者多层编码层对所述预设热词库中每一热词的音频片段进行编码,得到所述第三特征向量序列,所述编码层包括:长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,所述长短时记忆神经网络中长短时记忆层为基于单向或者双向的长短时记忆神经网络中长短时记忆层。
- 一种电子设备,其特征在于,包括:处理器,存储器,通信接口,以及一个或多个程序;所述一个或多个程序被存储在所述存储器中,并且被配置成由所述处理器执行,以执行权利要求1-9任一项方法中的步骤的指令。
- 一种计算机可读存储介质,其特征在于,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-9任一项所述的方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1-9任一项所述的方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023540012A JP7627766B2 (ja) | 2020-12-31 | 2021-01-26 | 音声認識方法、装置及び記憶媒体 |
| EP21912486.4A EP4273855B1 (en) | 2020-12-31 | 2021-01-26 | Speech recognition method and apparatus, and storage medium |
| KR1020237026093A KR20230159371A (ko) | 2020-12-31 | 2021-01-26 | 음성 인식 방법 및 장치, 그리고 저장 매체 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011641751.3A CN112767917B (zh) | 2020-12-31 | 2020-12-31 | 语音识别方法、装置及存储介质 |
| CN202011641751.3 | 2020-12-31 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022141706A1 true WO2022141706A1 (zh) | 2022-07-07 |
Family
ID=75698522
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/073773 Ceased WO2022141706A1 (zh) | 2020-12-31 | 2021-01-26 | 语音识别方法、装置及存储介质 |
Country Status (5)
| Country | Link |
|---|---|
| EP (1) | EP4273855B1 (zh) |
| JP (1) | JP7627766B2 (zh) |
| KR (1) | KR20230159371A (zh) |
| CN (1) | CN112767917B (zh) |
| WO (1) | WO2022141706A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117116264A (zh) * | 2023-02-20 | 2023-11-24 | 荣耀终端有限公司 | 一种语音识别方法、电子设备以及介质 |
Families Citing this family (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113112995B (zh) * | 2021-05-28 | 2022-08-05 | 思必驰科技股份有限公司 | 词声学特征系统、词声学特征系统的训练方法及系统 |
| CN113436614B (zh) * | 2021-07-02 | 2024-02-13 | 中国科学技术大学 | 语音识别方法、装置、设备、系统及存储介质 |
| CN113488052B (zh) * | 2021-07-22 | 2022-09-02 | 深圳鑫思威科技有限公司 | 无线语音传输和ai语音识别互操控方法 |
| CN113782007B (zh) * | 2021-09-07 | 2024-08-16 | 上海企创信息科技有限公司 | 一种语音识别方法、装置、语音识别设备及存储介质 |
| CN114155849A (zh) * | 2021-11-04 | 2022-03-08 | 北京搜狗科技发展有限公司 | 一种虚拟对象的处理方法、装置和介质 |
| CN114360516B (zh) * | 2021-12-10 | 2025-08-29 | 广州小鹏汽车科技有限公司 | 语音识别方法、服务器、语音识别系统和存储介质 |
| CN114333791A (zh) * | 2021-12-10 | 2022-04-12 | 广州小鹏汽车科技有限公司 | 语音识别方法、服务器、语音识别系统、可读存储介质 |
| CN119724171B (zh) * | 2024-10-24 | 2025-10-10 | 平安科技(深圳)有限公司 | 基于语音模型的词汇识别方法、装置、电子设备及介质 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10255922B1 (en) * | 2013-07-18 | 2019-04-09 | Google Llc | Speaker identification using a text-independent model and a text-dependent model |
| CN110214351A (zh) * | 2017-06-05 | 2019-09-06 | 谷歌有限责任公司 | 记录的媒体热词触发抑制 |
| CN111583909A (zh) * | 2020-05-18 | 2020-08-25 | 科大讯飞股份有限公司 | 一种语音识别方法、装置、设备及存储介质 |
| CN111783466A (zh) * | 2020-07-15 | 2020-10-16 | 电子科技大学 | 一种面向中文病历的命名实体识别方法 |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11145293B2 (en) * | 2018-07-20 | 2021-10-12 | Google Llc | Speech recognition with sequence-to-sequence models |
| US11295739B2 (en) | 2018-08-23 | 2022-04-05 | Google Llc | Key phrase spotting |
| CN110162749B (zh) * | 2018-10-22 | 2023-07-21 | 哈尔滨工业大学(深圳) | 信息提取方法、装置、计算机设备及计算机可读存储介质 |
| CN109829172B (zh) * | 2019-01-04 | 2023-07-04 | 北京先声智能科技有限公司 | 一种基于神经翻译的双向解码自动语法改错模型 |
| JP7234415B2 (ja) * | 2019-05-06 | 2023-03-07 | グーグル エルエルシー | 音声認識のためのコンテキストバイアス |
| CN110648658B (zh) * | 2019-09-06 | 2022-04-08 | 北京达佳互联信息技术有限公司 | 一种语音识别模型的生成方法、装置及电子设备 |
| CN111009237B (zh) | 2019-12-12 | 2022-07-01 | 北京达佳互联信息技术有限公司 | 语音识别方法、装置、电子设备及存储介质 |
| CN111199727B (zh) * | 2020-01-09 | 2022-12-06 | 厦门快商通科技股份有限公司 | 语音识别模型训练方法、系统、移动终端及存储介质 |
| CN111933115B (zh) * | 2020-10-12 | 2021-02-09 | 腾讯科技(深圳)有限公司 | 语音识别方法、装置、设备以及存储介质 |
-
2020
- 2020-12-31 CN CN202011641751.3A patent/CN112767917B/zh active Active
-
2021
- 2021-01-26 KR KR1020237026093A patent/KR20230159371A/ko active Pending
- 2021-01-26 EP EP21912486.4A patent/EP4273855B1/en active Active
- 2021-01-26 JP JP2023540012A patent/JP7627766B2/ja active Active
- 2021-01-26 WO PCT/CN2021/073773 patent/WO2022141706A1/zh not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10255922B1 (en) * | 2013-07-18 | 2019-04-09 | Google Llc | Speaker identification using a text-independent model and a text-dependent model |
| CN110214351A (zh) * | 2017-06-05 | 2019-09-06 | 谷歌有限责任公司 | 记录的媒体热词触发抑制 |
| CN111583909A (zh) * | 2020-05-18 | 2020-08-25 | 科大讯飞股份有限公司 | 一种语音识别方法、装置、设备及存储介质 |
| CN111783466A (zh) * | 2020-07-15 | 2020-10-16 | 电子科技大学 | 一种面向中文病历的命名实体识别方法 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4273855A4 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117116264A (zh) * | 2023-02-20 | 2023-11-24 | 荣耀终端有限公司 | 一种语音识别方法、电子设备以及介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4273855A4 (en) | 2024-10-23 |
| EP4273855B1 (en) | 2026-03-25 |
| EP4273855C0 (en) | 2026-03-25 |
| JP7627766B2 (ja) | 2025-02-06 |
| EP4273855A1 (en) | 2023-11-08 |
| KR20230159371A (ko) | 2023-11-21 |
| CN112767917B (zh) | 2022-05-17 |
| CN112767917A (zh) | 2021-05-07 |
| JP2024502048A (ja) | 2024-01-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022141706A1 (zh) | 语音识别方法、装置及存储介质 | |
| JP7407968B2 (ja) | 音声認識方法、装置、設備及び記憶媒体 | |
| CN113889076B (zh) | 语音识别及编解码方法、装置、电子设备及存储介质 | |
| CN110956959A (zh) | 语音识别纠错方法、相关设备及可读存储介质 | |
| WO2019169996A1 (zh) | 视频处理、视频检索方法、装置、存储介质及服务器 | |
| CN110069612B (zh) | 一种回复生成方法及装置 | |
| CN108959388B (zh) | 信息生成方法及装置 | |
| CN113239157B (zh) | 对话模型的训练方法、装置、设备和存储介质 | |
| CN112802444A (zh) | 语音合成方法、装置、设备及存储介质 | |
| CN116825084A (zh) | 跨语种的语音合成方法、装置、电子设备和存储介质 | |
| CN114373443A (zh) | 语音合成方法和装置、计算设备、存储介质及程序产品 | |
| CN113793591A (zh) | 语音合成方法及相关装置和电子设备、存储介质 | |
| CN115393849A (zh) | 一种数据处理方法、装置、电子设备及存储介质 | |
| CN117877460A (zh) | 语音合成方法、装置、语音合成模型训练方法、装置 | |
| JP7765622B2 (ja) | Rnn-tとして実装された自動音声認識システムにおける音響表現およびテキスト表現の融合 | |
| CN111930900A (zh) | 标准发音生成方法及相关装置 | |
| CN116343781A (zh) | 语音识别模型的训练方法及装置、存储介质及电子设备 | |
| CN114495914B (zh) | 语音识别方法、语音识别模型的训练方法及相关装置 | |
| CN109979461A (zh) | 一种语音翻译方法及装置 | |
| CN115457938A (zh) | 识别唤醒词的方法、装置、存储介质及电子装置 | |
| CN115376496A (zh) | 一种语音识别方法、装置、计算机设备及存储介质 | |
| CN111477212B (zh) | 内容识别、模型训练、数据处理方法、系统及设备 | |
| CN113793598A (zh) | 语音处理模型的训练方法和数据增强方法、装置及设备 | |
| CN114882880B (zh) | 基于解码器的语音唤醒方法及其相关设备 | |
| CN114974235B (zh) | 一种语音指令识别方法、装置和电子设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21912486 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023540012 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021912486 Country of ref document: EP Effective date: 20230731 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2021912486 Country of ref document: EP |
