WO2022141706A1 - 语音识别方法、装置及存储介质 - Google Patents

语音识别方法、装置及存储介质 Download PDF

Info

Publication number
WO2022141706A1
WO2022141706A1 PCT/CN2021/073773 CN2021073773W WO2022141706A1 WO 2022141706 A1 WO2022141706 A1 WO 2022141706A1 CN 2021073773 W CN2021073773 W CN 2021073773W WO 2022141706 A1 WO2022141706 A1 WO 2022141706A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
vector sequence
hot word
sequence
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/073773
Other languages
English (en)
French (fr)
Inventor
方昕
吴明辉
马志强
刘俊华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to JP2023540012A priority Critical patent/JP7627766B2/ja
Priority to EP21912486.4A priority patent/EP4273855B1/en
Priority to KR1020237026093A priority patent/KR20230159371A/ko
Publication of WO2022141706A1 publication Critical patent/WO2022141706A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • the present application relates to the technical field of speech recognition, and in particular, to a speech recognition method, device and storage medium.
  • Embodiments of the present application provide a speech recognition method, device, and storage medium, which can improve the accuracy of hot word recognition.
  • an embodiment of the present application provides a speech recognition method, the method comprising:
  • a decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
  • an embodiment of the present application provides a speech recognition device, the speech recognition device includes: an audio encoder module, a hot word text encoder module, a hot word audio encoder module, a frame-level attention module, and a decoder module ,in,
  • the audio encoder module is used to encode the speech data to be recognized to obtain the first feature vector sequence
  • the hot word text encoder module is used to encode each hot word in the preset hot word library to obtain a second feature vector sequence
  • the hot word audio encoder module is used to encode the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence
  • the frame-level attention module is configured to perform a first attention operation on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence;
  • the decoder module is configured to perform a decoding operation according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
  • an embodiment of the present application provides an electronic device, including: a processor, a memory, a communication interface, and one or more programs; wherein the one or more programs are stored in the memory, and are configured by The above-mentioned processor is executed, and the above-mentioned program includes instructions for executing steps in any method of the first aspect of the embodiments of the present application.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the computer program as described in the first embodiment of the present application.
  • an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute as implemented in the present application. Examples include some or all of the steps described in the first aspect.
  • the computer program product may be a software installation package.
  • the speech recognition method, device and related products described in the embodiments of the present application encode the speech data to be recognized to obtain the first feature vector sequence; encode each hot word in the preset hot word database to obtain The second feature vector sequence is to encode the audio segment of each hot word in the preset hot word library to obtain the third feature vector sequence, and the first attention operation is performed on the first feature vector sequence and the third feature vector sequence to obtain For the fourth feature vector sequence, the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained. Because not only the hot word text information is used as the input, but also the corresponding audio segment is used as the input.
  • FIG. 1A is a schematic structural diagram of a speech recognition model provided by an embodiment of the present application.
  • FIG. 1B is a schematic flowchart of a speech recognition method provided by an embodiment of the present application.
  • FIG. 1C is a schematic diagram of a demonstration of hot word encoding provided by an embodiment of the present application.
  • FIG. 1D is a schematic diagram of a demonstration of feature splicing provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another speech recognition method provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • 4A is a block diagram of functional units of a speech recognition device provided by an embodiment of the present application.
  • FIG. 4B is a block diagram of functional units of another speech recognition apparatus provided by an embodiment of the present application.
  • the electronic devices involved in the embodiments of the present application may include various handheld devices with speech recognition functions, voice recorders, smart robots, smart readers, smart translators, smart headphones, smart dictionaries, smart point readers, and in-vehicle devices. , wearable devices, computing devices or other processing devices connected to wireless modems, as well as various forms of user equipment (UserEquipment, UE), mobile station (MobileStation, MS), terminal device (terminal device), etc., electronic equipment can also For servers or smart home devices.
  • UE user equipment
  • MS mobile station
  • terminal device terminal device
  • the smart home device may be at least one of the following: refrigerators, washing machines, rice cookers, smart curtains, smart lights, smart beds, smart trash cans, microwave ovens, ovens, steamers, air conditioners, range hoods, servers, smart Doors, smart windows, window and door wardrobes, smart speakers, smart homes, smart chairs, smart drying racks, smart showers, water dispensers, water purifiers, air purifiers, doorbells, monitoring systems, smart garages, TVs, projectors, Smart dining table, smart sofa, massage chair, treadmill, etc.
  • FIG. 1A is a speech recognition model provided by an embodiment of the present application.
  • the speech recognition model includes: an audio encoder module, a hot word text encoder module, a hot word audio encoder module, and a frame-level attention module , a word-level attention module and a decoder module, the decoder module may include a decoder, and the speech recognition model can be used to implement the speech recognition function, as follows:
  • the electronic device can also use the hot word text encoder module to independently encode each hot word in the preset hot word database, so as to encode the hot words of different lengths into fixed-dimensional vectors, and obtain a set of features representing the hot words.
  • the preset hot thesaurus can be set in advance according to the user's needs. For example, the corresponding hot words can be adapted from the basic hot thesaurus according to one's own identity or occupation. Identity or occupation is pre-established.
  • the preset hot word database can also be automatically generated according to the user's historical situation. For example, during the use process, the user can automatically generate the preset hot word database for the hot words that appear during the use process. For another example, in a voice assistant scenario, the user's address book names can be read as hot words, and a preset hot word library can be generated according to these hot words. For another example, in the process of use, such as the input method, after legal authorization, some entity texts input by the user in pinyin, such as place names and personal names, can be remembered as hot words, and these hot words can be generated into a preset hot word database. .
  • the preset hot word library can be saved locally or in the cloud.
  • the audio segment of each hot word in the preset hot word library can be independently encoded by the hot word audio encoder module, and the hot word audio encoder module here can be shared with the aforementioned audio encoder module. It can be understood that the two are the same encoder. Therefore, the hot word audio clips of different lengths can also be encoded into fixed-dimensional vectors, that is, the encoder of the last frame of the hot word audio clip can be used or the average of all the hot word audio clips can be used.
  • the frame-level attention module performs the attention operation at the frame level on the audio coding representation of each frame (the first feature vector sequence) and the hot word audio coding representation (the third vector feature sequence), and fuses the hot word information, form a new audio coding representation, the fourth sequence of feature vectors
  • two ways can be used to perform the decoding operation, as follows:
  • the word-level attention module uses the state vector d t output by the decoder module at time t and the fourth feature vector sequence output by the frame-level attention module.
  • the second feature vector sequence H z output by the hot word text encoder and the third feature vector sequence H w output by the hot word audio encoder module are input, and the attention mechanism is used to calculate the audio context for predicting the t-th character.
  • the feature vector C t x , the hot word audio context feature vector C tw and the hot word text context feature vector C t z are input into the decoder module to complete decoding.
  • the decoder module can directly convert the fourth feature vector sequence output by the frame-level attention module
  • the second feature vector sequence H z output by the hot word text encoder and the third feature vector sequence H w output by the hot word audio encoder are input into the decoder to complete decoding.
  • the hot word excitation effect can be significantly improved, and then the three are decoded to improve the hot word recognition effect, thereby improving the hot word recognition accuracy.
  • FIG. 1B is a schematic flowchart of a speech recognition method provided by an embodiment of the present application. As shown in the figure, the speech recognition method shown in FIG. 1B is applied to the speech recognition model shown in FIG. 1A , The speech recognition model is applied to electronic equipment, and the speech recognition method includes:
  • the voice data to be recognized may be pre-stored or real-time collected voice data or a sequence of voice feature vectors, and the voice data may be at least one of the following: recording data, real-time recording recording data, and extracting from video data recording data, synthesized recording data, etc., which are not limited here.
  • the speech feature vector sequence may be at least one of the following: Filter Bank feature, Mel Frequency Cepstrum Coefficient (MFCC) feature, Perceptual Linear Predictive (PLP) feature, etc., which are not limited here.
  • the electronic device can perform feature extraction on the voice data to obtain a sequence of voice feature vectors, and then encode the sequence of voice feature vectors to obtain a first sequence of feature vectors.
  • the electronic device can directly encode the sequence of speech feature vectors to obtain a first sequence of feature vectors.
  • the electronic device can encode the speech data to be recognized through an audio encoder module to obtain a first feature vector sequence
  • the audio encoder module can be composed of one or more layers of encoding layers. It can be a long-short-term memory layer in a long-short-term memory neural network (LSTM) or a convolutional layer of a convolutional neural network, and the long-short-term memory neural network can be a one-way or two-way long-term memory neural network. time memory layer.
  • LSTM long-short-term memory neural network
  • LSTM long-short-term memory neural network
  • time memory layer time memory layer.
  • the preset hot word database may be stored in the electronic device in advance, and the preset hot word database may include text information of multiple hot words.
  • the electronic device can encode each hot word in the preset hot word database through the hot word text encoder module to obtain the second feature vector sequence.
  • the preset hot word library may also be pre-stored on other servers, and the preset hot word library may be obtained by accessing.
  • the number of words contained in different hot words can be the same or different. If the number of words is different, for example, the Japanese hot word "Tokyo” has two characters and "Kanagawa” has three characters, you can input variable length Represented by a fixed-dimensional vector to facilitate model processing.
  • the function of the hot word text encoder module is to encode hot words of different lengths into fixed-dimensional vectors, which can be included as one or more layers of encoding layers, which can be long and short-term memory layers in long-short-term memory neural networks or
  • the convolutional layer of the convolutional neural network, the long-short-term memory neural network can be a long-short-term memory layer in a one-way or two-way long-short-term memory neural network.
  • the bidirectional long and short-term memory layer has better coding effect on hot words than the one-way long and short-term memory layer.
  • ", "Nai”, “Chuan” three words a layer of bidirectional long and short-term memory layer of the hot word encoder to encode it is shown in Figure 1C, the left side of the figure is the forward direction of the bidirectional long and short-term memory layer part, the right side is the reverse part, splicing the output vectors h f z and h b z of the forward and reverse last step, the obtained vector h z is the encoding vector representation of the hot word, the encoding vector of multiple hot words Represents that a second sequence of feature vectors can be formed.
  • the electronic device can encode the audio segment of each hot word in the preset hot word database through the hot word audio encoder module to obtain the third feature vector sequence.
  • the third feature vector sequence can represent the audio information contained in the hot word audio segment.
  • the hot word audio encoder module and the above audio encoder module can be shared, that is, the two can share algorithms, such as: both are the same encoder, and the hot word audio encoder module can also include one or more layers Coding layer, the coding layer can be a long-short-term memory layer in a long-short-term memory neural network or a convolutional layer of a convolutional neural network, and the long-short-term memory neural network can be a unidirectional or bidirectional long-short-term memory neural network. .
  • the hot word audio encoder module and the above-mentioned audio encoder module may also be two independent encoders, which are not limited in the present invention.
  • the audio clips of hot words can be obtained in the following ways, which may include but are not limited to: intercepting from audio, artificially collecting, synthesizing using a speech synthesis system, etc., which are not limited here, and finally the audio clips of hot words can be obtained,
  • the audio segment of the hot word may be pre-stored, or may be an audio segment synthesized based on the hot word.
  • the last frame can represent the information of the entire audio sequence.
  • the last frame may not be taken, for example, the average of all frames may be taken.
  • z 0 is a special hot word " ⁇ no-bias>", which means that there is no hot word. In the specific implementation, it can be replaced by the average value of all hot word vectors.
  • all hot words The vectors can be all vectors of at least one of the second feature vector sequence and the third feature vector sequence.
  • no-bias> is motivated to deal with cases where no hot words exist in the speech or the speech segment being recognized is not a hot word.
  • the electronic device can perform the first attention operation on the first feature vector sequence and the third feature vector sequence through the frame-level attention module, so as to realize the fusion of the features of the two to obtain the fourth feature vector sequence, so that , which can significantly improve the hot word incentive effect.
  • the function of the frame-level attention module is to fuse the text information of the hot words in the preset hot word database for the output of the audio encoder module of each frame to form a fourth feature vector sequence with the representation of the hot word information, so that each The audio representation (the first sequence of feature vectors) of the to-be-recognized speech data of the frame is more robust to hot words.
  • a first attention operation is performed on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence, including:
  • the electronic device can associate the query item with each feature vector in the third feature vector sequence
  • the matching coefficient is calculated based on the attention mechanism. For example, the matching coefficient is calculated by calculating the inner product method or the feature distance method and normalized to obtain the matching coefficient W n , and then the matching coefficient W n is calculated with the corresponding feature vector h n z .
  • the operation method can be any of the following: point multiplication and summation, weighted operation, inner product, etc., which are not limited here. After the operation, a new feature vector h i z can be obtained, which is the feature that best matches the query item.
  • this frame-level attention module is to add content containing the audio information of the hot words in the preset hot word library during the audio coding process, so it is more conducive to the decoding accuracy of the hot words of the subsequent decoding module.
  • the electronic device may input the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to the decoder for decoding operation to obtain the recognition result, or the electronic device may also input the second feature vector sequence
  • the sequence, the third feature vector sequence, and the fourth feature vector sequence first perform the second attention operation, and then input the result to the decoder for decoding operation to obtain the recognition result.
  • a decoder can contain multiple neural network layers.
  • the decoding operation method can be Beam Search decoding, of course, can also be other decoding methods, which will not be repeated here.
  • a large amount of voice data with text annotations can be collected, and its voice features can be extracted, and the voice features can be at least one of the following: PLP, MFCC, FilterBank, etc., which are not limited here.
  • the text-annotated speech data collected here can be used to train the Hot Word Audio Encoder module.
  • the speech feature sequence and text annotation sequence of a certain speech data can be expressed as follows:
  • Speech feature sequence X [x 1 ,x 2 ,...,x k ,...,x K ]
  • Text annotation sequence Y [y 0 ,y 1 ,...,y t ,...,y T ]
  • x k represents the k-th frame speech feature vector in the speech feature sequence X
  • K is the total number of speech frames
  • y t represents the t-th character in the text annotation sequence Y
  • T+1 is the total number of characters in the total text annotation
  • y 0 is the sentence start symbol " ⁇ s>”
  • y T is the sentence end symbol " ⁇ /s>”.
  • the speech recognition model can have the ability to support arbitrary hot word recognition, which means that hot words cannot be limited in model training. Therefore, in this embodiment of the present application, annotated segments can be randomly selected as hot words from the text annotations of the training data to participate in the entire model training.
  • B is an integer greater than 1, and will be described in detail.
  • P and N can be set, where P is the probability of whether to select a hot word in the training data of a certain sentence, and N is the maximum number of words of the selected hot word.
  • the labeling comparison before and after the hot words can be selected from the sentence as shown in the following table:
  • the special tag “ ⁇ bias>” can be added after it.
  • the role of " ⁇ bias>” is to introduce training errors to force the model parameters related to hot words to be updated during model training, such as the model parameters of the hot word audio encoder module or the model parameters of the hot word text encoder module.
  • “dong” and “jing” are selected as hot words, they can be added to the hot word list of this model update as the input of the hot word audio encoder module or the hot word text encoder module.
  • the hot word selection work is performed independently for each model update, and the hot word list can be empty at the initial moment. After processing the data, the model parameters can be updated using neural network optimization methods.
  • the sample data and the real recognition result corresponding to the sample data are obtained, the sample data can be encoded to obtain the first feature vector sequence, and each hot word in the preset hot word database is encoded to obtain the second feature vector sequence, encode the audio segment of each hot word in the preset hot word library, obtain the third feature vector sequence, perform the first attention operation on the first feature vector sequence and the third feature vector sequence, and obtain the fourth feature vector Sequence, decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain the predicted recognition result, and the model parameters are updated according to the deviation between the actual recognition result and the preset recognition result.
  • the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result, which may include the following steps:
  • A51 Perform a second attention operation on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence, a hot word audio context feature vector sequence and audio context feature vector sequence;
  • A52 Input the hot word text context feature vector sequence, the hot word audio context feature vector sequence, and the audio context feature vector sequence to a decoder to perform a decoding operation to obtain a recognition result.
  • the electronic device can perform the second attention operation on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature Vector sequence, the function of the word-level attention module is to extract the audio-related feature vector and hot word correlation required at the moment to be decoded from the audio feature vector sequence, the hot word text feature vector sequence and the hot word audio feature vector sequence at each decoding moment.
  • Feature vector
  • the audio-related feature vector represents the audio content of the character to be decoded at the t-th time
  • the hot word text-related feature vector represents the possible hot words at the t-th time.
  • Text content hot word audio-related feature vector represents the possible hot word audio content at time t.
  • the attention mechanism can use a vector as the query item (query), perform the attention mechanism operation on a set of feature vector sequences, and select the feature vector that best matches the query item as the output, specifically: Calculate a matching coefficient between the query item and each feature vector in the feature vector sequence, and then multiply and sum these matching coefficients with the corresponding feature vector to obtain a new feature vector, which is the feature vector that best matches the query item. .
  • the second attention operation is performed on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence , hot word audio context feature vector sequence and audio context feature vector sequence, which can include the following steps:
  • A512 Perform an attention operation on the second feature vector sequence according to the first state feature vector to obtain the hot word text context feature vector sequence at the current moment;
  • A513. Perform an attention operation on the third feature vector sequence according to the first state feature vector to obtain the hot word audio context feature vector sequence at the current moment;
  • A514. Perform an attention operation on the fourth feature vector sequence according to the first state feature vector to obtain the audio context feature vector sequence at the current moment.
  • the audio context feature vector sequence c t x can be obtained, since the hot words participate in calculation, It contains the complete audio information of the potential hot words, and the c t x calculated in this way also contains information about whether the hot words are included and which hot words are.
  • the hot word text encoding module uses d t as the query item, the hot word text encoding module outputs the second feature vector sequence H z to perform the attention mechanism operation, and then the hot word text context feature vector sequence c t z can be obtained; for the same reason, use d t As a query item, the hot word audio context feature vector sequence c tw can be obtained by outputting the third feature vector sequence H w from the hot word audio coding module to perform the attention mechanism operation.
  • these three vectors can be spliced together and sent to the decoder module for decoding at the t-th moment.
  • the ct w of the audio information of the hot words corresponding to the hot words in the library is more conducive to the decoding accuracy of the subsequent hot words.
  • the second attention operation may also be performed on the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence based on the first feature vector sequence, so as to obtain hot words Text context feature vector sequence, hot word audio context feature vector sequence and audio context feature vector sequence.
  • the decoder includes a first-layer unidirectional long short-term memory layer
  • the above step A511, obtaining the first state feature vector of the decoder at the first historical moment may include the following steps:
  • A5111 obtain the recognition result of the first historical moment and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment;
  • A5112. Input the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment into the The first layer of one-way long short-term memory layer is used to obtain the first state feature vector.
  • the above-mentioned first historical moment is at least one moment before the current moment, that is, the first historical moment may be a moment before the current moment, or may also be a plurality of moments before the current moment;
  • the above-mentioned decoder may include Two-layer one-way long-short-term memory layer, the two-layer one-way long-short-term memory layer may include a first layer of one-way long-short-term memory layer and a second layer of one-way long-short-term memory layer.
  • the recognition result of the decoder at the historical moment and the hot word text context feature vector sequence, hot word audio context feature vector sequence and audio context feature vector sequence of the first historical moment are input to the first layer of one-way long and short-term memory layer. , to obtain the first state feature vector, and further, using the recognition result of the first historical moment and the corresponding input content to perform memory (feature) fusion, which helps to improve the prediction ability of the model.
  • the hot word text context feature vector sequence, hot word audio context feature vector sequence and audio context feature vector sequence at the first historical moment can be obtained by the following methods: obtaining the first state feature vector of the decoder at the first historical moment, and perform an attention operation on the second feature vector sequence according to the first state feature vector to obtain the hot word text context feature vector sequence at the first historical moment, and perform an attention operation on the third feature vector sequence according to the first state feature vector, The audio context feature vector sequence of the hot word at the first historical moment is obtained, and the attention operation is performed on the fourth feature vector sequence according to the first state feature vector to obtain the audio context feature vector sequence at the first historical moment.
  • this d t-1 can be used as a query item, and the second feature vector sequence and the third feature vector sequence input at the first historical moment , the fourth feature vector sequence to perform attention operation; d t-1 can be based on the recognition result of the second historical moment and the hot word text context feature vector sequence, hot word audio context feature vector sequence and audio context feature of the second historical moment
  • the vector sequence is input to the first one-way long short-term memory layer, and the first state feature vector is obtained.
  • the second historical moment may be at least one moment before the first historical moment, that is, the second historical moment may be a moment before the first historical moment, or may be multiple moments before the first historical moment.
  • it can also be based on all or part of the recognition results before the current moment and the hot word text context feature vector sequence, the hot word audio input to the decoder at the first historical moment
  • the context feature vector sequence and the audio context feature vector sequence are input into the first layer of one-way long-term and short-term memory layer to obtain the first state feature vector; and then, the current moment is input into the decoder of the hot word text context
  • the feature vector sequence, the hot word audio context feature vector sequence, and the audio context feature vector sequence are input to the second layer of one-way long-term memory layer to obtain the recognition result at the current moment.
  • the decoder includes a second-layer unidirectional long-term and short-term memory layer, and in the above step A52, the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the hot word audio context feature vector sequence are combined.
  • the audio context feature vector sequence is input to the decoder for decoding operation, and the recognition result is obtained, which may include the following steps:
  • the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence input into the decoder at the current moment into the second layer of one-way long and short-term memory layer,
  • the recognition result of the current moment is obtained, and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the current moment are respectively identified by the first state feature vector.
  • the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence at the current moment are obtained by performing the second attention operation.
  • the current moment can be understood as the current decoding moment.
  • the first historical moment is the moment before the current moment.
  • the moment before the current moment is the moment before the current moment, that is, decoding the t-1th word.
  • the moment of the word is the first historical moment.
  • the decoder can include two layers of one-way long and short-term memory layers. Taking the t-th character (time) as an example, when decoding the t-th character, the first layer of long-short-term memory layer uses the recognition result character y t at time t-1.
  • d t is input to the word-level attention module, which is used to calculate the output c t of the word-level attention module at the t-th time.
  • C t is the hot-word text context feature vector sequence, hot-word audio context feature vector sequence and Audio context feature vector sequence, then, ct is used as the input of the second long and short-term memory layer to calculate the output h t d of the decoder , and finally, the posterior probability of the output character is calculated, and the recognition result can be obtained.
  • the input of hot word text and voice fragments will effectively increase the richness of hot word input information , which is bound to greatly improve the effect of hot word incentives;
  • the use of double-layer incentives, that is, secondary attention operation will inevitably improve the effect of hot word incentives.
  • the two together improve the hot word recognition effect, which in turn helps to improve the hot word recognition accuracy.
  • the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result, which can be implemented as follows:
  • the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input to a decoder for decoding operation to obtain the recognition result.
  • the electronic device can directly input the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to the decoder for decoding operation to obtain the recognition result, because not only the hot word text information is used as input, but also the The corresponding audio clips are used as input, and the audio clips of the speech data to be recognized and the hot word text information are fused by attention operation and then used as input, which can significantly improve the hot word excitation effect, and then the three are decoded. It can improve the hot word recognition effect, thereby improving the hot word recognition accuracy.
  • the decoder includes two layers of one-way long and short-term memory layers, and the two layers of one-way long and short-term memory layers include a first layer of one-way long and short-term memory layers and a second layer of one-way long and short-term memory layers.
  • the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input into the decoder for decoding operation to obtain the recognition result, which may include the following steps :
  • the decoder may include multiple neural network layers.
  • the decoder may include two unidirectional long and short-term memory layers, and the two-layer unidirectional long and short-term memory layers include a first unidirectional long and short-term memory layer and a second layer. One-way long short-term memory layer.
  • the electronic device can obtain the recognition result of the first historical moment and the second, third and fourth feature vector sequences input to the decoder at the first historical moment, and input them into the first layer one-way long and short-term memory layer, obtain the second state feature vector, input the second feature vector sequence, third feature vector sequence and fourth feature vector sequence input to the decoder at the current moment to the second layer one-way long and short-term memory layer , to obtain the recognition result at the current moment, wherein the fourth feature vector sequence may be the first attention paid to at least one of the first feature vector sequence and the third feature vector sequence at the current moment through the second state feature vector
  • the operation is obtained, for example, the first attention operation can be performed on the first feature vector sequence and the third feature vector sequence at the current moment respectively through the second state feature vector.
  • the output content of the second unidirectional long-term and short-term memory layer of the decoder can be obtained, and a posteriori probability calculation can also be performed on the output content to obtain the final decoding result, that is, the recognition result
  • the speech data to be recognized is encoded to obtain a first feature vector sequence; and each hot word in the preset hot word library is encoded to obtain a second feature vector sequence , encode the audio segment of each hot word in the preset hot word library to obtain the third feature vector sequence, and perform the first attention operation on the first feature vector sequence and the third feature vector sequence to obtain the fourth feature vector sequence , the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained, because not only the hot word text information is used as input, but also its corresponding audio segment is used as input, and the Recognize the audio clips of speech data and hot word text information, perform attention operation and fuse them as input, which can significantly improve the hot word excitation effect, and then perform decoding operations on the three to improve the hot word recognition effect, thereby improving the hot word recognition effect. word recognition accuracy.
  • FIG. 2 is a schematic flowchart of a speech recognition method provided by an embodiment of the present application. As shown in the figure, the speech recognition method shown in FIG. 2 is applied to The speech recognition model shown in Figure 1A, the speech recognition model is applied to electronic equipment, the speech recognition method includes:
  • the input of the hot word text combined with the speech segment will Effectively increase the richness of hot word input information, which is bound to greatly improve the effect of hot word incentives;
  • the use of double-layer incentives, that is, secondary attention operation is bound to improve the effect of hot word incentives.
  • the two-level hot word incentive scheme complements each other, and the two jointly improve the hot word recognition effect, which in turn helps to improve the hot word recognition accuracy.
  • FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device includes a processor, a memory, a communication interface, and one or more A program, wherein the above-mentioned one or more programs are stored in the above-mentioned memory and are configured to be executed by the above-mentioned processor.
  • the above-mentioned program includes instructions for executing the following steps:
  • a decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
  • the electronic device described in the embodiment of the present application encodes the speech data to be recognized to obtain the first feature vector sequence; encodes each hot word in the preset hot word database to obtain the second feature vector sequence, Encoding the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence, and performing the first attention operation on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence, The decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained. Since not only the text information of the hot word is used as input, but also the corresponding audio segment is used as input, and the recognition result is obtained. The voice data and the audio clips of the hot word text information are fused by attention operation and used as input, which can significantly improve the hot word incentive effect, and then perform decoding operations on the three to improve the hot word recognition effect, thereby improving the hot word. recognition accuracy.
  • the above program includes the following steps: instruction:
  • the matching coefficient corresponding to each third eigenvector and the corresponding third eigenvector are operated to obtain a new eigenvector corresponding to each third eigenvector;
  • the new eigenvectors corresponding to the third eigenvectors are spliced with the corresponding first eigenvectors to obtain the characterization vectors corresponding to the third eigenvectors, and the characterization vectors corresponding to the third eigenvectors are converted to as the fourth feature vector sequence.
  • the above program includes the following steps: Instructions for steps:
  • the second attention operation is performed on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence
  • the above program includes instructions for performing the following steps:
  • the decoder includes a first-layer unidirectional long short-term memory layer, and in terms of acquiring the first state feature vector of the decoder at the first historical moment, the above program includes steps for performing the following steps: command:
  • the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment input to the A one-way long short-term memory layer is used to obtain the first state feature vector.
  • the decoder includes a second layer of one-way long-term and short-term memory, in the process of combining the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context
  • the above program includes instructions for executing the following steps:
  • the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence input into the decoder at the current moment into the second layer of one-way long and short-term memory layer,
  • the recognition result of the current moment is obtained, and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the current moment are respectively identified by the first state feature vector.
  • the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence at the current moment are obtained by performing the second attention operation.
  • the above program includes the following steps: Instructions for steps:
  • the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input to a decoder for decoding operation to obtain the recognition result.
  • the decoder includes two unidirectional long and short-term memory layers, and the two-layer unidirectional long and short-term memory layers include a first unidirectional long and short-term memory layer and a second unidirectional long and short-term memory layer layer, in the aspect of inputting the second eigenvector sequence, the third eigenvector sequence and the fourth eigenvector sequence to the decoder for decoding operation to obtain the recognition result, the above program includes a program for executing Instructions for the following steps:
  • the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence input to the decoder according to the recognition result of the first historical moment and the input of the first historical moment are input to the decoder. Describe the first layer of one-way long and short-term memory layer to obtain the second state feature vector;
  • the fourth feature vector sequence is to perform the first attention operation on the first feature vector sequence and the third feature vector sequence at the current moment through the second state feature vector and get.
  • encoding the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence including:
  • the audio segment of each hot word in the preset hot word library is encoded by one or more layers of encoding layers to obtain the third feature vector sequence
  • the encoding layer includes: long and short-term memory neural network A memory layer or a convolutional layer of a convolutional neural network, where the long-short-term memory layer in the long-short-term memory neural network is a long-short-term memory layer in a unidirectional or bidirectional long-short-term memory neural network.
  • the electronic device includes corresponding hardware structures and/or software modules for executing each function.
  • the present application can be implemented in hardware or in the form of a combination of hardware and computer software, in combination with the units and algorithm steps of each example described in the embodiments provided herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • the electronic device may be divided into functional units according to the foregoing method examples.
  • each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and other division methods may be used in actual implementation.
  • FIG. 4A is a block diagram of functional units of the speech recognition apparatus 400 involved in the embodiment of the present application.
  • the speech recognition apparatus 400 is applied to electronic equipment, and the speech recognition apparatus 400 includes: an audio encoder module 401, a hot word text encoder module 402, a hot word audio encoder module 403, a frame-level attention module 404, and a decoder Module 405, wherein,
  • the audio encoder module 401 is used to encode the speech data to be recognized to obtain a first feature vector sequence
  • the hot word text encoder module 402 is used to encode each hot word in the preset hot word database to obtain a second feature vector sequence
  • the hot word audio encoder module 403 is configured to encode the audio segment of each hot word in the preset hot word library to obtain a third feature vector sequence
  • the frame-level attention module 404 is configured to perform a first attention operation on the first feature vector sequence and the third feature vector sequence to obtain a fourth feature vector sequence;
  • the decoder module 405 is configured to perform a decoding operation according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a recognition result.
  • the speech recognition device described in the embodiment of the present application encodes the speech data to be recognized to obtain the first feature vector sequence; encodes each hot word in the preset hot word database to obtain the second feature vector sequence , encode the audio segment of each hot word in the preset hot word library to obtain the third feature vector sequence, and perform the first attention operation on the first feature vector sequence and the third feature vector sequence to obtain the fourth feature vector sequence , the decoding operation is performed according to the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence, and the recognition result is obtained, because not only the hot word text information is used as input, but also its corresponding audio segment is used as input, and the Recognize the audio clips of speech data and hot word text information, perform attention operation and fuse them as input, which can significantly improve the hot word excitation effect, and then perform decoding operations on the three to improve the hot word recognition effect, thereby improving the hot word recognition effect. word recognition accuracy.
  • the frame-level attention module 404 specifically Used for:
  • the matching coefficient corresponding to each third eigenvector and the corresponding third eigenvector are operated to obtain a new eigenvector corresponding to each third eigenvector;
  • the new eigenvectors corresponding to the third eigenvectors are spliced with the corresponding first eigenvectors to obtain the characterization vectors corresponding to the third eigenvectors, and the characterization vectors corresponding to the third eigenvectors are converted to as the fourth feature vector sequence.
  • its decoder module 405 may include: a word-level attention module 4051 and decoder 4052, where,
  • the word-level attention module 4051 is configured to perform a second attention operation on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence , Hot word audio context feature vector sequence and audio context feature vector sequence;
  • the decoder 4052 is configured to input the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence to the decoder for decoding operation to obtain a recognition result.
  • the second attention operation is performed on the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence to obtain a hot word text context feature vector sequence
  • the word-level attention module 4051 is specifically used for:
  • the decoder includes a first-layer unidirectional long short-term memory layer, and in the aspect of acquiring the first state feature vector of the decoder at the first historical moment, the word-level attention module 4051 Specifically for:
  • the hot word audio context feature vector sequence and the audio context feature vector sequence of the first historical moment input to the A one-way long short-term memory layer is used to obtain the first state feature vector.
  • the decoder includes a second layer of one-way long-term and short-term memory, in the process of combining the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context
  • the feature vector sequence is input to the decoder for decoding operation, and in terms of obtaining the recognition result, the decoder 4052 is specifically used for:
  • the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence input into the decoder at the current moment into the second layer of one-way long and short-term memory layer,
  • the recognition result of the current moment is obtained, and the hot word text context feature vector sequence, the hot word audio context feature vector sequence and the audio context feature vector sequence of the current moment are respectively identified by the first state feature vector.
  • the second feature vector sequence, the third feature vector sequence, and the fourth feature vector sequence at the current moment are obtained by performing the second attention operation.
  • the decoder module 405 specifically Used for:
  • the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence are input to a decoder for decoding operation to obtain the recognition result.
  • the decoder includes two unidirectional long and short-term memory layers, and the two-layer unidirectional long and short-term memory layers include a first unidirectional long and short-term memory layer and a second unidirectional long and short-term memory layer layer, in the aspect of inputting the second eigenvector sequence, the third eigenvector sequence and the fourth eigenvector sequence to the decoder for decoding operation to obtain the recognition result, the decoder module 405 Specifically for:
  • the second feature vector sequence, the third feature vector sequence and the fourth feature vector sequence input to the decoder according to the recognition result of the first historical moment and the input of the first historical moment are input to the decoder. Describe the first layer of one-way long and short-term memory layer to obtain the second state feature vector;
  • the fourth feature vector sequence is to perform the first attention operation on the first feature vector sequence and the third feature vector sequence at the current moment through the second state feature vector and get.
  • the hot word audio encoder module 403 is specifically configured to:
  • the audio segment of each hot word in the preset hot word library is encoded by one or more layers of encoding layers to obtain the third feature vector sequence
  • the encoding layer includes: long and short-term memory neural network A memory layer or a convolutional layer of a convolutional neural network, where the long-short-term memory layer in the long-short-term memory neural network is a long-short-term memory layer in a unidirectional or bidirectional long-short-term memory neural network.
  • Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute part or all of the steps of any method described in the above method embodiments , the above computer includes electronic equipment.
  • Embodiments of the present application further provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of the method embodiments described above. some or all of the steps of the method.
  • the computer program product may be a software installation package, and the computer includes an electronic device.
  • the disclosed apparatus may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the above-mentioned units is only a logical function division.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the above-mentioned integrated units if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable memory.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种语音识别方法、装置、电子设备、存储介质和程序产品,语音识别方法包括:对待识别语音数据进行编码,得到第一特征向量序列(101);对预设热词库中每一热词进行编码,得到第二特征向量序列(102);对预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列(103);将第一特征向量序列和第三特征向量序列进行第一注意力操作,得到第四特征向量序列(104);根据第二特征向量序列、第三特征向量序列和第四特征向量序列进行解码操作,得到识别结果(105)。方法能够提升热词识别精度。

Description

语音识别方法、装置及存储介质
本申请要求2020年12月31日递交的发明名称为“语音识别方法、装置及存储介质”的申请号202011641751.3的在先申请优先权,上述在先申请的内容以引入的方式并入本文本中。
技术领域
本申请涉及语音识别技术领域,具体涉及一种语音识别方法、装置及存储介质。
背景技术
在语音识别领域,由于端到端模型输出的低频词得分很低,传统的热词得分激励方法效果提升有限。而谷歌提出的CLAS(Contextual Listen,Attend and Spell,CLAS)从模型层面对热词进行激励,已经取得了不错的效果,但是做法过于简单,很容易将不包含热词的句子也误识别出热词,导致整体识别率下降,在实际系统中难以直接使用,因此,如何提升热词识别精度的问题亟待解决。
发明内容
本申请实施例提供了一种语音识别方法、装置及存储介质,能够提升热词识别精度。
第一方面,本申请实施例提供一种语音识别方法,所述方法包括:
对待识别语音数据进行编码,得到第一特征向量序列;
对预设热词库中每一热词进行编码,得到第二特征向量序列;
对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;
将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;
根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
第二方面,本申请实施例提供一种语音识别装置,所述语音识别装置包括:音频编码器模块、热词文本编码器模块、热词音频编码器模块、帧层级注意力模块和解码器模块,其中,
所述音频编码器模块,用于对待识别语音数据进行编码,得到第一特征向量序列;
所述热词文本编码器模块,用于对预设热词库中每一热词进行编码,得到第二特征向量序列;
所述热词音频编码器模块,用于对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;
所述帧层级注意力模块,用于将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;
所述解码器模块,用于根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
第三方面,本申请实施例提供一种电子设备,包括:处理器,存储器,通信接口,以及一个或多个程序;其中,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,上述程序包括用于执行本申请实施例第一方面任一方法中的步骤的指令。
第四方面,本申请实施例提供了一种计算机可读存储介质,其中,上述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,上述计算机程序使得计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。
第五方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
实施本申请实施例,具备如下有益效果:
可以看出,本申请实施例中所描述的语音识别方法、装置及相关产品,对待识别语音数据进行编码, 得到第一特征向量序列;对预设热词库中每一热词进行编码,得到第二特征向量序列,对预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,将第一特征向量序列和第三特征向量序列进行第一注意力操作,得到第四特征向量序列,根据第二特征向量序列、第三特征向量序列和第四特征向量序列进行解码操作,得到识别结果,由于不仅将热词文本信息作为输入,还将其对应的音频片段作为输入,以及将待识别语音数据以及热词文本信息的音频片段进行注意力操作加以融合后作为输入,进而,能够显著提升热词激励效果,再将三者进行解码操作,能够提升热词识别效果,从而,提升热词识别精度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A是本申请实施例提供的一种语音识别模型的结构示意图;
图1B是本申请实施例提供的一种语音识别方法的流程示意图;
图1C是本申请实施例提供的热词编码的演示示意图;
图1D是本申请实施例提供的特征拼接的演示示意图;
图2是本申请实施例提供的另一种语音识别方法的流程示意图;
图3是本申请实施例提供的一种电子设备的结构示意图;
图4A是本申请实施例提供的一种语音识别装置的功能单元组成框图;
图4B是本申请实施例提供的另一种语音识别装置的功能单元组成框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是在一个可能地示例中还包括没有列出的步骤或单元,或在一个可能地示例中还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请实施例所涉及到的电子设备可以是包括各种具备语音识别功能的手持设备、录音笔、智能机器人、智能阅读器、智能翻译机、智能耳机、智能词典、智能点读机、车载设备、可穿戴设备、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(UserEquipment,UE),移动台(MobileStation,MS),终端设备(terminaldevice)等等,电子设备还可以为服务器或者智能家居设备。
本申请实施例中,智能家居设备可以为以下至少一种:冰箱、洗衣机、电饭煲、智能窗帘、智能灯、智能床、智能垃圾桶、微波炉、烤箱、蒸箱、空调、油烟机、服务器、智能门、智能窗户、窗门衣柜、智能音箱、智能家居、智能椅、智能晾衣架、智能淋浴、饮水机、净水器、空气净化器、门铃、监控系统、智能车库、电视机、投影仪、智能餐桌、智能沙发、按摩椅、跑步机等等。
下面对本申请实施例进行详细介绍。
请参见图1A,图1A是本申请实施例提供的一种语音识别模型,该语音识别模型包括:音频编码器模块、热词文本编码器模块、热词音频编码器模块、帧层级注意力模块、词层级注意力模块和解码器模块,解码器模块可以包括解码器,该语音识别模型可以用于实现语音识别功能,具体如下:
首先,利用音频编码器模块,对待识别语音数据的语音特征向量序列X=[x 1,x 2,……,x K]进行编码,得到第一特征向量序列H x=[h 1 x,h 2 x,……,h K x],其中,x k表示第k帧语音特征向量,h k x为音频编码器模块的最后一个神经网络层输出的特征向量,h k x对应x k经过音频编码器模块变换后的结果。电子设备还可以用热词文本编码器模块对预设热词库中的每个热词进行独立编码,实现将长度不一的热词编码为固定维度的向量,得到一组表示热词的特征向量序列,即第二特征向量序列H z=[h 0 z,h 1 z,……,h N z],其中,h n z为第n个热词经过热词编码器模块编码处理后的特征向量。预设热词库可以根据用户需求事先设置,例如,可以依据自己的身份或者职业从基础热词库中适配出相应的热词,作为预设热词库,基础热词库可以基于不同的身份或者职业预先建立。预设热词库也可以根据用户历史情况自动生成,例如,用户在使用过程中,可以将使用过程中出现的热词自动生成预设热词库。又例如,语音助手场景下,可以读取用户的通讯录人名作为热词,并根据这些热词生成预设热词库。又例如,在使用过程中,如输入法,经合法授权后,可以根据用户拼音输入的一些实体文本,如地名、人名记住后,作为热词,并将这些热词生成预设热词库。预设热词库可以保存在本地或者云端。
接着,可以用热词音频编码器模块对上述预设热词库中的每个热词的音频片段进行独立编码,此处的热词音频编码器模块可与前述的音频编码器模块共享,共享的意思可以理解为两者为同一个编码器,从而,也可以将长度不一的热词音频片段编码为固定维度的向量,即可以采用热词音频片段的最后一帧的编码器或者平均所有帧的输出代表整个热词音频片段的表征向量,从而得到一组表示热词音频的第三特征向量序列H w=[h 0 w,h 1 w,……,h N w],其中h n w为第n个热词音频片段经过热词编码器模块编码处理后的特征向量。然后,帧层级注意力模块对每一帧的音频编码表征(第一特征向量序列)和热词音频编码表征(第三向量特征序列),在帧层级上进行注意力操作,融合热词信息,形成新的音频编码表征,即第四特征向量序列
Figure PCTCN2021073773-appb-000001
进而可以采用两种方式进行解码操作,具体如下:
其一,词层级注意力模块,以解码器模块第t时刻输出的状态向量d t、帧层级注意力模块输出的第四特征向量序列
Figure PCTCN2021073773-appb-000002
和热词文本编码器输出的第二特征向量序列H z以及热词音频编码器模块输出的第三特征向量序列H w为输入,使用注意力机制,计算得到预测第t个字符用的音频上下文特征向量C t x、热词音频上下文特征向量C t w以及热词文本上下文特征向量C t z,输入到解码器模块中完成解码。
其二,解码器模块,可以直接将帧层级注意力模块输出的第四特征向量序列
Figure PCTCN2021073773-appb-000003
和热词文本编码器输出的第二特征向量序列H z以及热词音频编码器输出的第三特征向量序列H w输入到解码器中完成解码。
具体实现中,由于不仅将热词文本信息作为输入,还将其对应的音频片段作为输入,以及将待识别语音数据以及热词文本信息的音频片段进行注意力操作加以融合后作为输入,进而,能够显著提升热词激励效果,再将三者进行解码操作,能够提升热词识别效果,从而,提升热词识别精度。
进一步地,请参阅图1B,图1B是本申请实施例提供的一种语音识别方法的流程示意图,如图所示,图1B所示的语音识别方法应用于图1A所示的语音识别模型,该语音识别模型应用于电子设备,本语音识别方法包括:
101、对待识别语音数据进行编码,得到第一特征向量序列。
其中,本申请实施例中,待识别语音数据可以为预先存储或者实时采集的语音数据或者语音特征向量序列,语音数据可以为以下至少一种:录音数据、实时录音的录音数据、视频数据中提取的录音数据、合成的录音数据等等,在此不作限定。语音特征向量序列可以为以下至少一种:Filter Bank特征、Mel频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征、感知线性预测系数(Perceptual Linear Predictive,PLP)特征等等,在此不作限定。例如,在待识别语音数据为语音数据时,电子设备可以对 该语音数据进行特征提取,得到语音特征向量序列,再对语音特征向量序列进行编码,得到第一特征向量序列,又例如,在待识别语音数据为语音特征向量序列时,电子设备可以直接对该语音特征向量序列进行编码,得到第一特征向量序列。
具体实现中,如图1B所示,电子设备可以通过音频编码器模块对待识别语音数据进行编码,得到第一特征向量序列,该音频编码器模块可以包含为一层或多层编码层,编码层可以为长短时记忆神经网络(Long Short-Term Memory,LSTM)中长短时记忆层或者卷积神经网络的卷积层,该长短时记忆神经网络可以为单向或双向长短时记忆神经网络中长短时记忆层。举例说明下,本申请实施例中,可以选择使用3层单向长短时记忆层对输入的语音特征向量序列X=[x 1,x 2,……,x K]进行编码,输出第一特征向量序列H x=[h 1 x,h 2 x,……,h K x]。
102、对预设热词库中每一热词进行编码,得到第二特征向量序列。
其中,预设热词库可以预先保存在电子设备中,预设热词库中可以包括多个热词的文本信息。电子设备可以通过热词文本编码器模块对预设热词库中的每一热词进行编码,得到第二特征向量序列。或者,其它实现方式中,预设热词库也可以预先保存在其它服务器上,通过访问可以获取预设热词库。
具体实现中,不同热词包含的字数可以一样,也可以不一样,如果字数不一样,比如,日语热词“東京”有两个字、“神奈川”有三个字,则可以将变长的输入用一个固定维度的向量来表示,以便于模型处理。热词文本编码器模块的作用在于将长度不同的热词编码成固定维度的向量,它可以包含为一层或多层编码层,该编码层可以为长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,该长短时记忆神经网络可以为单向或双向长短时记忆神经网络中长短时记忆层。
具体实现中,双向长短时记忆层对热词的编码效果好于单向长短时记忆层,如选择使用一层双向长短时记忆层,以热词“神奈川”为例,该热词由“神”、“奈”、“川”三个字组成,一层双向长短时记忆层的热词编码器对它编码的示意图如图1C所示,图中的左边为双向长短时记忆层的正向部分,右边为反向部分,将正向和反向最后一步的输出向量h f z和h b z进行拼接,得到的向量h z即为热词的编码向量表示,多个热词的编码向量表示可以构成第二特征向量序列。假设共有N+1个热词Z=[z 0,z 1,……,z N],使用热词编码器模块对每个热词独立进行处理,得到第二特征向量序列H z=[h 0 z,h 1 z,……,h N z],其中,h i z为第i个热词z i的编码向量。需要特别说明的是,z 0是一个特殊的热词“<no-bias>”,表示不存在热词,在后期解码的过程中选中的热词为“<no-bias>”时,则不会对<no-bias>进行激励,以处理语音中不存在热词或者正要识别的语音片段不是热词的情况。
103、对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列。
其中,电子设备可以通过热词音频编码器模块对预设热词库中每一热词的音频片段进行编码,得到第三特征向量序列。第三特征向量序列可以表征热词音频片段所含的音频信息。
具体实现中,热词音频编码器模块与上述音频编码器模块可共享,即两者可以共享算法,如:两者为同一个编码器,热词音频编码器模块也可以包含一层或多层编码层,该编码层可以为长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,该长短时记忆神经网络可以为单向或双向长短时记忆神经网络中长短时记忆层。或者,在其它实现方式中,热词音频编码器模块与上述音频编码器模块也可以是独立的两个编码器,本发明对此不做限定。
其中,热词的音频片段可以通过如下几种方式得到,可以包括但不限于:从音频中截取、人为采集、利用语音合成系统合成等,在此不作限定,最终可以得到热词的音频片段,例如,该热词的音频片段可以是预先存储的,也可以是基于热词而合成的音频片段。举例说明下,为了描述方便,本申请实施例中,可以选择使用3层单向长短时记忆层对输入的热词音频片段特征向量序列X=[x 1,x 2,……,x K]进行编码,取最后一帧的输出作为该热词的音频片段的表征向量,由于编码方式是LSTM,LSTM为一个递归的神经网络,所以最后一帧就能代表整个音频序列的信息,当然,在其它实施方式中,也可以不取最后一帧,比如,取所有帧的平均数。假设共有N+1个热词Z=[z 0,z 1,……,z N],使用热词音频编码器模块对每个热词音频独立进行编码,得到一组热词音频向量序列,即第三特征向量序列H w=[h 0 w,h 1 w,……,h N w],其中h i w为第i个热词z i的音频编码向量。需要特别说明的是,z 0是一个特殊的热词“<no-bias>”,表示不存在 热词,在具体实现中,它可以取所有热词向量的平均值代替,此处所有热词向量可以为第二特征向量序列和第三特征向量序列中的至少一个特征向量序列的所有向量,在后期解码的过程中选中的热词为“<no-bias>”时,则不会对<no-bias>进行激励,以处理语音中不存在热词或者正要识别的语音片段不是热词的情况。
104、将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列。
具体实现中,电子设备可以通过帧层级注意力模块将第一特征向量序列和第三特征向量序列进行第一注意力操作,以实现将两者的特征加以融合,得到第四特征向量序列,如此,可以显著提升热词激励效果。
其中,帧层级注意力模块其作用在于对于每一帧音频编码器模块的输出融合预设热词库中的热词文本信息后形成具有热词信息的表征的第四特征向量序列,使得每一帧的待识别语音数据的音频表征(第一特征向量序列)更具热词鲁棒性。具体地,第一注意力操作的注意力机制可以使用音频编码器模块输出的一帧向量h i x作为查询项(query),对热词音频编码器输出的第三特征向量序列H z=[h 0 z,h 1 z,……,h N z]进行注意力机制操作。
在一个可能地示例中,上述步骤104,将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列,包括:
41、将所述第一特征向量序列中的各个第一特征向量与所述第三特征向量序列中的每一第三特征向量进行匹配运算,得到各第三特征向量对应的匹配系数;
42、将所述各第三特征向量对应的匹配系数与对应的第三特征向量进行运算,得到所述各第三特征向量对应的新特征向量;
43、将所述各第三特征向量对应的新特征向量与对应的第三特征向量进行拼接,得到所述各第三特征向量对应的表征向量,将所述各第三特征向量对应的表征向量作为所述第四特征向量序列。
具体地,如图1D所示,以第一特征向量序列中的任一第一特征向量为例,将其作为查询项,则电子设备可以将查询项与第三特征向量序列中每个特征向量基于注意力机制计算匹配系数,例如,通过求内积方式或者特征距离方式计算匹配系数并归一化得到匹配系数W n,再将匹配系数W n与对应的特征向量h n z进行运算,该运算方式可以为以下任一种:点乘并求和、加权运算、求内积等等,在此不作限定,运算后可以得到一个新的特征向量h i z即为与查询项最匹配的特征向量,并与查询项进行拼接,得到融合后的最后的音频编码表征 h i x,对每一帧音频编码器输出向量均进行上述操作,得到最终的第四特征向量序列
Figure PCTCN2021073773-appb-000004
本帧层级注意力模块主要目的在于让音频在编码过程中增加了包含预设热词库中的热词音频信息的内容,因此,其更有利于后续解码模块的热词的解码准确率。
105、根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
其中,具体实现中,电子设备可以将第二特征向量序列、第三特征向量序列和第四特征向量序列输入到解码器进行解码操作,得到识别结果,或者,电子设备也可以将第二特征向量序列、第三特征向量序列、第四特征向量序列先进行第二注意力操作,再将其结果输入到解码器进行解码操作,得到识别结果。解码器可以包含多个神经网络层。解码操作的方式可以为Beam Search解码,当然,还可以为其他解码方式,在此不再赘述。
本申请实施例中,在步骤101之前,可以收集大量带文本标注的语音数据,并提取其语音特征,该语音特征可以为以下至少一种:PLP、MFCC、FilterBank等,在此不作限定。此处收集的带文本标注的语音数据可以用于训练热词音频编码器模块。某句语音数据的语音特征序列和文本标注序列可用如下方式表示:
语音特征序列X=[x 1,x 2,……,x k,……,x K]
文本标注序列Y=[y 0,y 1,……,y t,……,y T]
其中,x k表示语音特征序列X中第k帧语音特征向量,K为总语音帧数目;y t表示文本标注序列Y 中第t个字符,T+1为该总文本标注的总字符数目,其中,y 0为句子开始符“<s>”,y T为句子结束符“</s>”。以日文语音识别为例,并用单个字作为建模单元。假设某句话的文本内容为“今日は東京は風が強い”,共有10个字,加上句子开始符和句子结束符,文本标注序列总共12个字符,则文本标注序列Y=[<s>,今,日,は,東,京,は,風,が,強,い,</s>]。
具体实现中,语音识别模型可以具备支持任意热词识别的能力,这就说明在模型训练中不能限定热词。因此,本申请实施例可以从训练数据的文本标注中随机挑选标注片段作为热词,以参与整个模型训练。以下以B句语音数据进行一次模型训练为例,B为大于1的整数,进行详细说明。例如,可以设置P和N两个参数,P为某句训练数据是否挑选热词的概率,N为挑选的热词最大字数。本申请实施例中,可以设置P=0.5,N=4,也就是说任何一句训练数据有50%的概率被选中,从它的文本标注,挑选最多连续4个字作为热词。以“今日は東京は風が強い”为例,可以从该句挑选出热词前后的标注对比如下表所示:
Figure PCTCN2021073773-appb-000005
当原始标注中“東”、“京”被挑选为热词,可以在它的后面添加特殊标记符“<bias>”。“<bias>”的作用是引入训练错误,以强迫模型训练时更新热词相关的模型参数,比如热词音频编码器模块的模型参数或者热词文本编码器模块的模型参数。当“東”、“京”被选为热词后,可以将它加入这次模型更新的热词列表中,作为热词音频编码器模块或者热词文本编码器模块的输入。每次模型更新热词挑选工作独立进行,初始时刻热词列表可以为空。在处理好数据之后,即可用神经网络优化方法更新模型参数。在训练阶段,获取样本数据以及该样本数据对应的真实识别结果,可以对样本数据进行编码,得到第一特征向量序列,对预设热词库中每一热词进行编码,得到第二特征向量序列,对预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,将第一特征向量序列和第三特征向量序列进行第一注意力操作,得到第四特征向量序列,根据第二特征向量序列、第三特征向量序列和第四特征向量序列进行解码操作,得到预测识别结果,依据真实识别结果和预设识别结果之间的偏差实现模型参数更新。
在一个可能地示例中,上述步骤105,根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果,可以包括如下步骤:
A51、将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列;
A52、将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
其中,电子设备可以将第二特征向量序列、第三特征向量序列和第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列,词层级注意力模块的作用在于每个解码时刻从音频特征向量序列和热词文本特征向量序列以及热词音频特征向量序列中提取待解码时刻所需的音频相关特征向量和热词相关特征向量。以第t个字符为例,在模型预测第t个字符时,可以认为音频相关特征向量表示的是第t时刻待解码字符的音频内容,热词文本相关特征向量表示第t时刻可能的热词文本内容,热词音频相关特征向量表示第t时刻可能的热词音频内容。
其中,词层级注意力机制,注意力机制可以使用一个向量作为查询项(query),对一组特征向量序列进行注意力机制操作,选出与查询项最匹配的特征向量作为输出,具体为:将查询项与特征向量序列中每个特征向量计算一个匹配系数,然后将这些匹配系数与对应的特征向量相乘并求和,得到一个新的特征向量,即为与查询项最匹配的特征向量。
在一个可能地示例中,上述步骤A51,将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列,可以包括如下步骤:
A511、获取当前时刻所述解码器的第一状态特征向量;
A512、依据所述第一状态特征向量对所述第二特征向量序列进行注意力操作,得到所述当前时刻的所述热词文本上下文特征向量序列;
A513、依据所述第一状态特征向量对所述第三特征向量序列进行注意力操作,得到所述当前时刻的所述热词音频上下文特征向量序列;
A514、依据所述第一状态特征向量对所述第四特征向量序列进行注意力操作,得到所述当前时刻的所述音频上下文特征向量序列。
具体实现中,假设当前时刻的解码器的第一状态特征向量为d t,可以用d t作为查询项,采用上述的注意力机制,可以将上述帧层级注意力模块输出第四特征向量序列
Figure PCTCN2021073773-appb-000006
进行注意力机制操作,即可得到音频上下文特征向量序列c t x,由于热词参与了
Figure PCTCN2021073773-appb-000007
的计算,
Figure PCTCN2021073773-appb-000008
包含了潜在热词的完整音频信息,由此方式计算得到的c t x也包含了是否包含热词、具体是哪个热词的信息。同理,用d t作为查询项,将热词文本编码模块输出第二特征向量序列H z进行注意力机制操作,即可得到热词文本上下文特征向量序列c t z;同理,用d t作为查询项,将热词音频编码模块输出第三特征向量序列H w进行注意力机制操作,即可得到热词音频上下文特征向量序列c t w
进一步地,还可以在计算得到c t x、c t z和c t w后,可以将这三个向量拼接起来送入解码器模块,进行第t时刻的解码,由于增加了包含预设热词库中的热词对应的热词音频信息的c t w,因此,更有利于后续热词的解码准确率。
或者,在其它实现方式中,也可以基于第一特征向量序列分别对第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,从而得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列。
进一步地,在一个可能地示例中,所述解码器包括第一层单向长短时记忆层,上述步骤A511,获取第一历史时刻所述解码器的第一状态特征向量,可以包括如下步骤:
A5111、获取第一历史时刻的识别结果以及该第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列;
A5112、依据所述第一历史时刻的识别结果以及所述第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到所述第一状态特征向量。
其中,上述第一历史时刻为当前时刻前面的至少一个时刻,也就是,第一历史时刻可以为当前时刻的前一时刻,或者,也可以为当前时刻之前的多个时刻;上述解码器可以包括两层单向长短时记忆层,该两层单向长短时记忆层可以包括第一层单向长短时记忆层和第二层单向长短时记忆层,具体实现中,电子设备可以获取第一历史时刻的解码器的识别结果以及该第一历史时刻的热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列,将其输入到第一层单向长短时记忆层,得到第一状态特征向量,进而,利用第一历史时刻的识别结果以及对应的输入内容进行记忆(特征)融合,有助于提升模型预测能力。
具体实现中,第一历史时刻的热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列可以通过如下方式得到:获取第一历史时刻解码器的第一状态特征向量,并依据该第一状态特征向量对第二特征向量序列进行注意力操作,得到第一历史时刻的热词文本上下文特征向量序列,依据第一状态特征向量对第三特征向量序列进行注意力操作,得到第一历史时刻的热词音频上下文特征向量序列,依据第一状态特征向量对第四特征向量序列进行注意力操作,得到第一历史时刻的音频上下文特征向量序列。例如,假设第一历史时刻的解码器的第一状态特征向量d t-1,可以将该d t-1作为查询项,对第一历史时刻输入的第二特征向量序列、第三特征向量序列、第四特征向量序列进行注意力操作;d t-1可以是根据第二历史时刻的识别结果以及第二历史时刻的热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列输入到第一层单向长短时记忆层,得到的第一状态特征向量。上述第二历史时刻可以为第一历史时刻前面的至少一个时刻,也就是,第二历史时刻可以 为第一历史时刻的前一时刻,或者,也可以为第一历史时刻之前的多个时刻。
或者在另一种实现方式中,也可以依据当前时刻之前的所有或部分的识别结果以及该第一历史时刻的输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到第一状态特征向量;进而,将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果。
进一步地,在一个可能地示例中,所述解码器包括第二层单向长短时记忆层,上述步骤A52,将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果,可以包括如下步骤:
将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述当前时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列为通过第一状态特征向量分别对所述当前时刻的第二特征向量序列、所述第三特征向量序列以及所述第四特征向量序列进行所述第二注意力操作而得到。
具体实现中,当前时刻可以理解为当前解码时刻,比如,第一历史时刻为当前时刻的前一时刻,解码第t个词的时候,当前时刻的前一时刻,也就是解码第t-1个词的时刻为第一历史时刻。解码器可以包括两层单向长短时记忆层,以第t个字符(时刻)为例,在解码第t个字符时,第一层长短时记忆层以t-1时刻的识别结果字符y t-1和词层级注意力模块的输出向量c t-1(t-1个字符时候的输入解码器的热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列)作为输入,计算得到解码器的第一状态特征向量d t。d t输入给词层级注意力模块,用于计算第t时刻词层级注意力模块的输出c t,c t即为第t时刻的热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列,然后,c t作为第二层长短时记忆层的输入,计算得到解码器的输出h t d,最终,进行输出字符的后验概率的计算,可以得到识别结果。
基于上述本申请实施例,其一,除了以热词文本信息为输入之外,增加热词语音片段为额外输入激励源,热词文本联合语音片段的输入将有效增加热词输入信息的丰富性,势必可以很大程度上提升热词激励的效果;其二,采用双层激励,即二次注意力操作,势必会提升热词激励的效果,双输入与双层级热词激励方案相辅相成,两者共同提升了热词识别效果,进而,有助于提升热词识别精度。
在一个可能地示例中,上述步骤105,根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果,可以按照如下方式实施:
将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果。
具体实现中,电子设备可以直接将第二特征向量序列、第三特征向量序列和第四特征向量序列输入到解码器进行解码操作,得到识别结果,由于不仅将热词文本信息作为输入,还将其对应的音频片段作为输入,以及将待识别语音数据以及热词文本信息的音频片段进行注意力操作加以融合后作为输入,进而,能够显著提升热词激励效果,再将三者进行解码操作,能够提升热词识别效果,从而,提升热词识别精度。
进一步地,在一个可能地示例中,所述解码器包括两层单向长短时记忆层,所述两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层,上述步骤,将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果,可以包括如下步骤:
B51、获取第一历史时刻的识别结果以及该第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列;
B52、依据所述第一历史时刻的识别结果以及所述第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第一层单向长短时记忆层,得到 第二状态特征向量;
B53、将当前时刻输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述第四特征向量序列为通过所述第二状态特征向量对所述当前时刻的所述第一特征向量序列和所述第三特征向量序列进行所述第一注意力操作而得到。
其中,解码器可以包含多个神经网络层,例如,解码器可以包括两层单向长短时记忆层,该两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层。
具体实现中,电子设备可以获取第一历史时刻的识别结果以及第一历史时刻的输入解码器的第二特征向量序列、第三特征向量序列和第四特征向量序列,并将其输入到第一层单向长短时记忆层,得到第二状态特征向量,将当前时刻输入解码器的第二特征向量序列、第三特征向量序列和第四特征向量序列输入到第二层单向长短时记忆层,得到当前时刻的识别结果,其中,第四特征向量序列可以为通过第二状态特征向量对当前时刻的第一特征向量序列和第三特征向量序列中的至少一个特征向量序列进行第一注意力操作而得到,例如,可以通过第二状态特征向量分别对当前时刻的第一特征向量序列和第三特征向量序列进行第一注意力操作。进而,可以得到解码器的第二层单向长短时记忆层的输出内容,还可以对输出内容进行后验概率计算,得到最终的解码结果,即当前时刻的识别结果。
可以看出,本申请实施例中所描述的语音识别方法,对待识别语音数据进行编码,得到第一特征向量序列;对预设热词库中每一热词进行编码,得到第二特征向量序列,对预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,将第一特征向量序列和第三特征向量序列进行第一注意力操作,得到第四特征向量序列,根据第二特征向量序列、第三特征向量序列和第四特征向量序列进行解码操作,得到识别结果,由于不仅将热词文本信息作为输入,还将其对应的音频片段作为输入,以及将待识别语音数据以及热词文本信息的音频片段进行注意力操作加以融合后作为输入,进而,能够显著提升热词激励效果,再将三者进行解码操作,能够提升热词识别效果,从而,提升热词识别精度。
与上述图1B所示的实施例一致地,请参阅图2,图2是本申请实施例提供的一种语音识别方法的流程示意图,如图所示,图2所示的语音识别方法应用于图1A所示的语音识别模型,该语音识别模型应用于电子设备,本语音识别方法包括:
201、对待识别语音数据进行编码,得到第一特征向量序列。
202、对预设热词库中每一热词进行编码,得到第二特征向量序列。
203、对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列。
204、将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列。
205、将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列。
206、将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
其中,上述步骤201-步骤206的具体描述可以参照上述图1B所描述的语音识别方法的相应步骤,在此不再赘述。
可以看出,本申请实施例中所描述的语音识别方法,其一,除了以热词文本信息为输入之外,增加热词语音片段为额外输入激励源,热词文本联合语音片段的输入将有效增加热词输入信息的丰富性,势必可以很大程度上提升热词激励的效果;其二,采用双层激励,即二次注意力操作,势必会提升热词激励的效果,双输入与双层级热词激励方案相辅相成,两者共同提升了热词识别效果,进而,有助于提升热词识别精度。
与上述实施例一致地,请参阅图3,图3是本申请实施例提供的一种电子设备的结构示意图,如图 所示,该电子设备包括处理器、存储器、通信接口以及一个或多个程序,其中,上述一个或多个程序被存储在上述存储器中,并且被配置由上述处理器执行,本申请实施例中,上述程序包括用于执行以下步骤的指令:
对待识别语音数据进行编码,得到第一特征向量序列;
对预设热词库中每一热词进行编码,得到第二特征向量序列;
对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;
将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;
根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
可以看出,本申请实施例中所描述的电子设备,对待识别语音数据进行编码,得到第一特征向量序列;对预设热词库中每一热词进行编码,得到第二特征向量序列,对预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,将第一特征向量序列和第三特征向量序列进行第一注意力操作,得到第四特征向量序列,根据第二特征向量序列、第三特征向量序列和第四特征向量序列进行解码操作,得到识别结果,由于不仅将热词文本信息作为输入,还将其对应的音频片段作为输入,以及将待识别语音数据以及热词文本信息的音频片段进行注意力操作加以融合后作为输入,进而,能够显著提升热词激励效果,再将三者进行解码操作,能够提升热词识别效果,从而,提升热词识别精度。
在一个可能地示例中,在所述将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列方面,上述程序包括用于执行以下步骤的指令:
将所述第一特征向量序列中的各个第一特征向量与所述第三特征向量序列中的每一第三特征向量进行匹配运算,得到各第三特征向量对应的匹配系数;
将所述各第三特征向量对应的匹配系数与对应的第三特征向量进行运算,得到所述各第三特征向量对应的新特征向量;
将所述各第三特征向量对应的新特征向量与对应的所述第一特征向量进行拼接,得到所述各第三特征向量对应的表征向量,将所述各第三特征向量对应的表征向作为所述第四特征向量序列。
在一个可能地示例中,在所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果方面,上述程序包括用于执行以下步骤的指令:
将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列;
将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
在一个可能地示例中,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列方面,上述程序包括用于执行以下步骤的指令:
获取当前时刻所述解码器的第一状态特征向量;
依据所述第一状态特征向量对所述第二特征向量序列进行注意力操作,得到所述当前时刻的所述热词文本上下文特征向量序列;
依据所述第一状态特征向量对所述第三特征向量序列进行注意力操作,得到所述当前时刻的所述热词音频上下文特征向量序列;
依据所述第一状态特征向量对所述第四特征向量序列进行注意力操作,得到所述当前时刻的所述音频上下文特征向量序列。
在一个可能地示例中,所述解码器包括第一层单向长短时记忆层,在所述获取第一历史时刻所述解码器的第一状态特征向量方面,上述程序包括用于执行以下步骤的指令:
获取第一历史时刻的识别结果以及该第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列;
依据所述第一历史时刻的识别结果以及所述第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到所述第一状态特征向量。
在一个可能地示例中,所述解码器包括第二层单向长短时记忆层,在所述将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果方面,上述程序包括用于执行以下步骤的指令:
将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述当前时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列为通过第一状态特征向量分别对所述当前时刻的第二特征向量序列、所述第三特征向量序列以及所述第四特征向量序列进行所述第二注意力操作而得到。
在一个可能地示例中,在所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果方面,上述程序包括用于执行以下步骤的指令:
将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果。
在一个可能地示例中,所述解码器包括两层单向长短时记忆层,所述两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果方面,上述程序包括用于执行以下步骤的指令:
获取第一历史时刻的识别结果以及该第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列;
依据所述第一历史时刻的识别结果以及所述第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第一层单向长短时记忆层,得到第二状态特征向量;
将当前时刻输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述第四特征向量序列为通过所述第二状态特征向量对所述当前时刻的所述第一特征向量序列和所述第三特征向量序列进行所述第一注意力操作而得到。
在一个可能地示例中,所述对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,包括:
通过一层或者多层编码层对所述预设热词库中每一热词的音频片段进行编码,得到所述第三特征向量序列,所述编码层包括:长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,所述长短时记忆神经网络中长短时记忆层为基于单向或者双向的长短时记忆神经网络中长短时记忆层。
上述主要从方法侧执行过程的角度对本申请实施例的方案进行了介绍。可以理解的是,电子设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所提供的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对电子设备进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
图4A是本申请实施例中所涉及的语音识别装置400的功能单元组成框图。该语音识别装置400,应用于电子设备,所述语音识别装置400包括:音频编码器模块401、热词文本编码器模块402、热词音频编码器模块403、帧层级注意力模块404和解码器模块405,其中,
所述音频编码器模块401,用于对待识别语音数据进行编码,得到第一特征向量序列;
所述热词文本编码器模块402,用于对预设热词库中每一热词进行编码,得到第二特征向量序列;
所述热词音频编码器模块403,用于对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;
所述帧层级注意力模块404,用于将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;
所述解码器模块405,用于根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
可以看出,本申请实施例中所描述的语音识别装置,对待识别语音数据进行编码,得到第一特征向量序列;对预设热词库中每一热词进行编码,得到第二特征向量序列,对预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,将第一特征向量序列和第三特征向量序列进行第一注意力操作,得到第四特征向量序列,根据第二特征向量序列、第三特征向量序列和第四特征向量序列进行解码操作,得到识别结果,由于不仅将热词文本信息作为输入,还将其对应的音频片段作为输入,以及将待识别语音数据以及热词文本信息的音频片段进行注意力操作加以融合后作为输入,进而,能够显著提升热词激励效果,再将三者进行解码操作,能够提升热词识别效果,从而,提升热词识别精度。
在一个可能地示例中,在所述将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列方面,所述帧层级注意力模块404具体用于:
将所述第一特征向量序列中的各个第一特征向量与所述第三特征向量序列中的每一第三特征向量进行匹配运算,得到各第三特征向量对应的匹配系数;
将所述各第三特征向量对应的匹配系数与对应的第三特征向量进行运算,得到所述各第三特征向量对应的新特征向量;
将所述各第三特征向量对应的新特征向量与对应的所述第一特征向量进行拼接,得到所述各第三特征向量对应的表征向量,将所述各第三特征向量对应的表征向作为所述第四特征向量序列。
在一个可能地示例中,如图4B所示,图4B为图4A所示的语音识别装置的又一变型结构,其与图4A相比较,其解码器模块405可以包括:词层级注意力模块4051和解码器4052,其中,
所述词层级注意力模块4051,用于将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列;
所述解码器4052,用于将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
在一个可能地示例中,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列方面,所述词层级注意力模块4051具体用于:
获取当前时刻所述解码器的第一状态特征向量;
依据所述第一状态特征向量对所述第二特征向量序列进行注意力操作,得到所述当前时刻的所述热词文本上下文特征向量序列;
依据所述第一状态特征向量对所述第三特征向量序列进行注意力操作,得到所述当前时刻的所述热词音频上下文特征向量序列;
依据所述第一状态特征向量对所述第四特征向量序列进行注意力操作,得到所述当前时刻的所述音频上下文特征向量序列。
在一个可能地示例中,所述解码器包括第一层单向长短时记忆层,在所述获取第一历史时刻所述解 码器的第一状态特征向量方面,所述词层级注意力模块4051具体用于:
获取第一历史时刻的识别结果以及该第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列;
依据所述第一历史时刻的识别结果以及所述第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到所述第一状态特征向量。
在一个可能地示例中,所述解码器包括第二层单向长短时记忆层,在所述将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果方面,所述解码器4052具体用于:
将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述当前时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列为通过第一状态特征向量分别对所述当前时刻的第二特征向量序列、所述第三特征向量序列以及所述第四特征向量序列进行所述第二注意力操作而得到。
在一个可能地示例中,在所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果方面,所述解码器模块405具体用于:
将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果。
在一个可能地示例中,所述解码器包括两层单向长短时记忆层,所述两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果方面,所述解码器模块405具体用于:
获取第一历史时刻的识别结果以及该第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列;
依据所述第一历史时刻的识别结果以及所述第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第一层单向长短时记忆层,得到第二状态特征向量;
将当前时刻输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述第四特征向量序列为通过所述第二状态特征向量对所述当前时刻的所述第一特征向量序列和所述第三特征向量序列进行所述第一注意力操作而得到。
在一个可能地示例中,在所述对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列方面,所述热词音频编码器模块403具体用于:
通过一层或者多层编码层对所述预设热词库中每一热词的音频片段进行编码,得到所述第三特征向量序列,所述编码层包括:长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,所述长短时记忆神经网络中长短时记忆层为基于单向或者双向的长短时记忆神经网络中长短时记忆层。
可以理解的是,本实施例的语音识别装置的各程序模块的功能可根据上述方法实施例中的方法具体实现,其具体实现过程可以参照上述方法实施例的相关描述,此处不再赘述。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤,上述计算机包括电子设备。
本申请实施例还提供一种计算机程序产品,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如上述方法实施例中记载的任一方法的部 分或全部步骤。该计算机程序产品可以为一个软件安装包,上述计算机包括电子设备。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例上述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (21)

  1. 一种语音识别方法,其特征在于,所述方法包括:
    对待识别语音数据进行编码,得到第一特征向量序列;
    对预设热词库中每一热词进行编码,得到第二特征向量序列;
    对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;
    将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;
    根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列,包括:
    将所述第一特征向量序列中的各个第一特征向量与所述第三特征向量序列中的每一第三特征向量进行匹配运算,得到各第三特征向量对应的匹配系数;
    将所述各第三特征向量对应的匹配系数与对应的第三特征向量进行运算,得到所述各第三特征向量对应的新特征向量;
    将所述各第三特征向量对应的新特征向量与对应的所述第一特征向量进行拼接,得到所述各第三特征向量对应的表征向量,将所述各第三特征向量对应的表征向作为所述第四特征向量序列。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果,包括:
    将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列;
    将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
  4. 根据权利要求3所述的方法,其特征在于,所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列,包括:
    获取当前时刻所述解码器的第一状态特征向量;
    依据所述第一状态特征向量对所述第二特征向量序列进行注意力操作,得到所述当前时刻的所述热词文本上下文特征向量序列;
    依据所述第一状态特征向量对所述第三特征向量序列进行注意力操作,得到所述当前时刻的所述热词音频上下文特征向量序列;
    依据所述第一状态特征向量对所述第四特征向量序列进行注意力操作,得到所述当前时刻的所述音频上下文特征向量序列。
  5. 根据权利要求4所述的方法,其特征在于,所述解码器包括第一层单向长短时记忆层,所述获取第一历史时刻所述解码器的第一状态特征向量,包括:
    获取第一历史时刻的识别结果以及该第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列;
    依据所述第一历史时刻的识别结果以及所述第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到所述第一状态特征向量。
  6. 根据权利要求3所述的方法,其特征在于,所述解码器包括第二层单向长短时记忆层,所述将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果,包括:
    将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序 列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述当前时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列为通过第一状态特征向量分别对所述当前时刻的第二特征向量序列、所述第三特征向量序列以及所述第四特征向量序列进行所述第二注意力操作而得到。
  7. 根据权利要求1所述的方法,其特征在于,所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果,包括:
    将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果。
  8. 根据权利要求7所述的方法,其特征在于,所述解码器包括两层单向长短时记忆层,所述两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层,所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果,包括:
    获取第一历史时刻的识别结果以及该第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列;
    依据所述第一历史时刻的识别结果以及所述第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第一层单向长短时记忆层,得到第二状态特征向量;
    将当前时刻输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述第四特征向量序列为通过所述第二状态特征向量对所述当前时刻的所述第一特征向量序列和所述第三特征向量序列进行所述第一注意力操作而得到。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列,包括:
    通过一层或者多层编码层对所述预设热词库中每一热词的音频片段进行编码,得到所述第三特征向量序列,所述编码层包括:长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,所述长短时记忆神经网络中长短时记忆层为基于单向或者双向的长短时记忆神经网络中长短时记忆层。
  10. 一种语音识别装置,其特征在于,所述语音识别装置包括:音频编码器模块、热词文本编码器模块、热词音频编码器模块、帧层级注意力模块和解码器模块,其中,
    所述音频编码器模块,用于对待识别语音数据进行编码,得到第一特征向量序列;
    所述热词文本编码器模块,用于对预设热词库中每一热词进行编码,得到第二特征向量序列;
    所述热词音频编码器模块,用于对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列;
    所述帧层级注意力模块,用于将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列;
    所述解码器模块,用于根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果。
  11. 根据权利要求10所述的装置,其特征在于,在所述将所述第一特征向量序列和所述第三特征向量序列进行第一注意力操作,得到第四特征向量序列方面,所述帧层级注意力模块具体用于:
    将所述第一特征向量序列中的各个第一特征向量与所述第三特征向量序列中的每一第三特征向量进行匹配运算,得到各第三特征向量对应的匹配系数;
    将所述各第三特征向量对应的匹配系数与对应的第三特征向量进行运算,得到所述各第三特征向量对应的新特征向量;
    将所述各第三特征向量对应的新特征向量与对应的所述第一特征向量进行拼接,得到所述各第三特 征向量对应的表征向量,将所述各第三特征向量对应的表征向作为所述第四特征向量序列。
  12. 根据权利要求10或11所述的装置,其特征在于,在所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果方面,所述编码器模块具体用于:
    将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列;
    将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果。
  13. 根据权利要求12所述的装置,其特征在于,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行第二注意力操作,得到热词文本上下文特征向量序列、热词音频上下文特征向量序列和音频上下文特征向量序列方面,所述编码器模块具体用于:
    获取当前时刻所述解码器的第一状态特征向量;
    依据所述第一状态特征向量对所述第二特征向量序列进行注意力操作,得到所述热词文本上下文特征向量序列;
    依据所述第一状态特征向量对所述第三特征向量序列进行注意力操作,得到所述热词音频上下文特征向量序列;
    依据所述第一状态特征向量对所述第四特征向量序列进行注意力操作,得到所述音频上下文特征向量序列。
  14. 根据权利要求13所述的装置,其特征在于,所述解码器包括第一层单向长短时记忆层,在所述获取第一历史时刻所述解码器的第一状态特征向量方面,所述编码器模块具体用于:
    获取第一历史时刻的识别结果以及该第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列;
    依据所述第一历史时刻的识别结果以及所述第一历史时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第一层单向长短时记忆层,得到所述第一状态特征向量。
  15. 根据权利要求10所述的装置,其特征在于,所述解码器包括第二层单向长短时记忆层,在所述将所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到解码器进行解码操作,得到识别结果方面,所述编码器模块具体用于:
    将当前时刻输入所述解码器的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述当前时刻的所述热词文本上下文特征向量序列、所述热词音频上下文特征向量序列和所述音频上下文特征向量序列为通过所述第一状态特征向量分别对所述当前时刻的第二特征向量序列、所述第三特征向量序列以及所述第四特征向量序列进行所述第二注意力操作而得到。
  16. 根据权利要求10所述的装置,其特征在于,在所述根据所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列进行解码操作,得到识别结果方面,所述编码器模块具体用于:
    将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果。
  17. 根据权利要求16所述的装置,其特征在于,所述解码器包括两层单向长短时记忆层,所述两层单向长短时记忆层包括第一层单向长短时记忆层和第二层单向长短时记忆层,在所述将所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到解码器进行解码操作,得到所述识别结果方面,所述编码器模块具体用于:
    获取第一历史时刻的识别结果以及该第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列;
    依据所述第一历史时刻的识别结果以及所述第一历史时刻的输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第一层单向长短时记忆层,得到第二 状态特征向量;
    将当前时刻输入所述解码器的所述第二特征向量序列、所述第三特征向量序列和所述第四特征向量序列输入到所述第二层单向长短时记忆层,得到所述当前时刻的识别结果,所述第四特征向量序列为通过所述第二状态特征向量对所述当前时刻的所述第一特征向量序列和所述第三特征向量序列进行所述第一注意力操作而得到。
  18. 根据权利要求10-17任一项所述的装置,其特征在于,在所述对所述预设热词库中每个热词的音频片段进行编码,得到第三特征向量序列方面,所述热词音频编码器模块具体用于:
    通过一层或者多层编码层对所述预设热词库中每一热词的音频片段进行编码,得到所述第三特征向量序列,所述编码层包括:长短时记忆神经网络中长短时记忆层或者卷积神经网络的卷积层,所述长短时记忆神经网络中长短时记忆层为基于单向或者双向的长短时记忆神经网络中长短时记忆层。
  19. 一种电子设备,其特征在于,包括:处理器,存储器,通信接口,以及一个或多个程序;所述一个或多个程序被存储在所述存储器中,并且被配置成由所述处理器执行,以执行权利要求1-9任一项方法中的步骤的指令。
  20. 一种计算机可读存储介质,其特征在于,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-9任一项所述的方法。
  21. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1-9任一项所述的方法。
PCT/CN2021/073773 2020-12-31 2021-01-26 语音识别方法、装置及存储介质 Ceased WO2022141706A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023540012A JP7627766B2 (ja) 2020-12-31 2021-01-26 音声認識方法、装置及び記憶媒体
EP21912486.4A EP4273855B1 (en) 2020-12-31 2021-01-26 Speech recognition method and apparatus, and storage medium
KR1020237026093A KR20230159371A (ko) 2020-12-31 2021-01-26 음성 인식 방법 및 장치, 그리고 저장 매체

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011641751.3A CN112767917B (zh) 2020-12-31 2020-12-31 语音识别方法、装置及存储介质
CN202011641751.3 2020-12-31

Publications (1)

Publication Number Publication Date
WO2022141706A1 true WO2022141706A1 (zh) 2022-07-07

Family

ID=75698522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073773 Ceased WO2022141706A1 (zh) 2020-12-31 2021-01-26 语音识别方法、装置及存储介质

Country Status (5)

Country Link
EP (1) EP4273855B1 (zh)
JP (1) JP7627766B2 (zh)
KR (1) KR20230159371A (zh)
CN (1) CN112767917B (zh)
WO (1) WO2022141706A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116264A (zh) * 2023-02-20 2023-11-24 荣耀终端有限公司 一种语音识别方法、电子设备以及介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112995B (zh) * 2021-05-28 2022-08-05 思必驰科技股份有限公司 词声学特征系统、词声学特征系统的训练方法及系统
CN113436614B (zh) * 2021-07-02 2024-02-13 中国科学技术大学 语音识别方法、装置、设备、系统及存储介质
CN113488052B (zh) * 2021-07-22 2022-09-02 深圳鑫思威科技有限公司 无线语音传输和ai语音识别互操控方法
CN113782007B (zh) * 2021-09-07 2024-08-16 上海企创信息科技有限公司 一种语音识别方法、装置、语音识别设备及存储介质
CN114155849A (zh) * 2021-11-04 2022-03-08 北京搜狗科技发展有限公司 一种虚拟对象的处理方法、装置和介质
CN114360516B (zh) * 2021-12-10 2025-08-29 广州小鹏汽车科技有限公司 语音识别方法、服务器、语音识别系统和存储介质
CN114333791A (zh) * 2021-12-10 2022-04-12 广州小鹏汽车科技有限公司 语音识别方法、服务器、语音识别系统、可读存储介质
CN119724171B (zh) * 2024-10-24 2025-10-10 平安科技(深圳)有限公司 基于语音模型的词汇识别方法、装置、电子设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255922B1 (en) * 2013-07-18 2019-04-09 Google Llc Speaker identification using a text-independent model and a text-dependent model
CN110214351A (zh) * 2017-06-05 2019-09-06 谷歌有限责任公司 记录的媒体热词触发抑制
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111783466A (zh) * 2020-07-15 2020-10-16 电子科技大学 一种面向中文病历的命名实体识别方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11145293B2 (en) * 2018-07-20 2021-10-12 Google Llc Speech recognition with sequence-to-sequence models
US11295739B2 (en) 2018-08-23 2022-04-05 Google Llc Key phrase spotting
CN110162749B (zh) * 2018-10-22 2023-07-21 哈尔滨工业大学(深圳) 信息提取方法、装置、计算机设备及计算机可读存储介质
CN109829172B (zh) * 2019-01-04 2023-07-04 北京先声智能科技有限公司 一种基于神经翻译的双向解码自动语法改错模型
JP7234415B2 (ja) * 2019-05-06 2023-03-07 グーグル エルエルシー 音声認識のためのコンテキストバイアス
CN110648658B (zh) * 2019-09-06 2022-04-08 北京达佳互联信息技术有限公司 一种语音识别模型的生成方法、装置及电子设备
CN111009237B (zh) 2019-12-12 2022-07-01 北京达佳互联信息技术有限公司 语音识别方法、装置、电子设备及存储介质
CN111199727B (zh) * 2020-01-09 2022-12-06 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111933115B (zh) * 2020-10-12 2021-02-09 腾讯科技(深圳)有限公司 语音识别方法、装置、设备以及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255922B1 (en) * 2013-07-18 2019-04-09 Google Llc Speaker identification using a text-independent model and a text-dependent model
CN110214351A (zh) * 2017-06-05 2019-09-06 谷歌有限责任公司 记录的媒体热词触发抑制
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111783466A (zh) * 2020-07-15 2020-10-16 电子科技大学 一种面向中文病历的命名实体识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4273855A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116264A (zh) * 2023-02-20 2023-11-24 荣耀终端有限公司 一种语音识别方法、电子设备以及介质

Also Published As

Publication number Publication date
EP4273855A4 (en) 2024-10-23
EP4273855B1 (en) 2026-03-25
EP4273855C0 (en) 2026-03-25
JP7627766B2 (ja) 2025-02-06
EP4273855A1 (en) 2023-11-08
KR20230159371A (ko) 2023-11-21
CN112767917B (zh) 2022-05-17
CN112767917A (zh) 2021-05-07
JP2024502048A (ja) 2024-01-17

Similar Documents

Publication Publication Date Title
WO2022141706A1 (zh) 语音识别方法、装置及存储介质
JP7407968B2 (ja) 音声認識方法、装置、設備及び記憶媒体
CN113889076B (zh) 语音识别及编解码方法、装置、电子设备及存储介质
CN110956959A (zh) 语音识别纠错方法、相关设备及可读存储介质
WO2019169996A1 (zh) 视频处理、视频检索方法、装置、存储介质及服务器
CN110069612B (zh) 一种回复生成方法及装置
CN108959388B (zh) 信息生成方法及装置
CN113239157B (zh) 对话模型的训练方法、装置、设备和存储介质
CN112802444A (zh) 语音合成方法、装置、设备及存储介质
CN116825084A (zh) 跨语种的语音合成方法、装置、电子设备和存储介质
CN114373443A (zh) 语音合成方法和装置、计算设备、存储介质及程序产品
CN113793591A (zh) 语音合成方法及相关装置和电子设备、存储介质
CN115393849A (zh) 一种数据处理方法、装置、电子设备及存储介质
CN117877460A (zh) 语音合成方法、装置、语音合成模型训练方法、装置
JP7765622B2 (ja) Rnn-tとして実装された自動音声認識システムにおける音響表現およびテキスト表現の融合
CN111930900A (zh) 标准发音生成方法及相关装置
CN116343781A (zh) 语音识别模型的训练方法及装置、存储介质及电子设备
CN114495914B (zh) 语音识别方法、语音识别模型的训练方法及相关装置
CN109979461A (zh) 一种语音翻译方法及装置
CN115457938A (zh) 识别唤醒词的方法、装置、存储介质及电子装置
CN115376496A (zh) 一种语音识别方法、装置、计算机设备及存储介质
CN111477212B (zh) 内容识别、模型训练、数据处理方法、系统及设备
CN113793598A (zh) 语音处理模型的训练方法和数据增强方法、装置及设备
CN114882880B (zh) 基于解码器的语音唤醒方法及其相关设备
CN114974235B (zh) 一种语音指令识别方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912486

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023540012

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021912486

Country of ref document: EP

Effective date: 20230731

WWG Wipo information: grant in national office

Ref document number: 2021912486

Country of ref document: EP