WO2021135611A1 - 一种语音识别的方法、装置、终端以及存储介质 - Google Patents
一种语音识别的方法、装置、终端以及存储介质 Download PDFInfo
- Publication number
- WO2021135611A1 WO2021135611A1 PCT/CN2020/125608 CN2020125608W WO2021135611A1 WO 2021135611 A1 WO2021135611 A1 WO 2021135611A1 CN 2020125608 W CN2020125608 W CN 2020125608W WO 2021135611 A1 WO2021135611 A1 WO 2021135611A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- signal
- speech recognition
- extended
- recognition model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/086—Detection of language
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
Definitions
- This application belongs to the field of data processing technology, and in particular relates to a method, device, terminal, and storage medium for speech recognition.
- the existing speech recognition technology has high recognition accuracy due to the large number of samples of basic language types, while for non-basic language types, such as dialects and minor languages, the recognition accuracy is low due to the small number of samples. It can be seen that the existing speech recognition technology has a low recognition accuracy for non-basic languages, which affects the applicability of the speech recognition technology.
- the embodiments of the present application provide a voice recognition method, device, terminal, and storage medium, which can solve the problems of low recognition accuracy and poor applicability of the existing voice recognition technology for non-basic languages.
- an embodiment of the present application provides a voice recognition method, including:
- the speech recognition model is obtained by training through a training sample set, and the training sample set includes a plurality of extended speech signals, the extended text information corresponding to each extended speech signal, the original speech signal corresponding to each extended speech signal, and The original text information corresponding to each original speech signal, and the extended speech signal is obtained by conversion based on the existing text of the basic language type.
- the method before the inputting the target language signal into the speech recognition model corresponding to the target language type to obtain the text information output by the speech recognition model, the method further includes :
- the method before the inputting the target language signal into the speech recognition model corresponding to the target language type to obtain the text information output by the speech recognition model, the method further includes :
- a second native speech model is trained to obtain the real-time speech recognition model.
- the training a second native speech model according to the pronunciation probability matrix and the extended speech signal to obtain the real-time speech recognition model includes:
- fine-grained training is performed on the quasi-real-time speech model to obtain the real-time speech recognition model.
- the coarse-grained training on the second native voice model based on the pronunciation probability matrix and the extended voice text to obtain a quasi-real-time voice model includes:
- the loss function is specifically:
- Loss top_k is the loss amount; Is the probability value of the t-th frame and the c-th utterance in the extended speech signal in the prediction probability matrix; Is the probability value of the t-th frame and the c-th utterance in the extended speech signal in the pronunciation probability matrix processed by the optimization algorithm; T is the total number of frames; C is the total number of utterances recognized in the t-th frame; Is the probability value of the t-th frame and the c-th pronunciation in the extended speech signal in the pronunciation probability matrix; After sorting all the pronunciations of the t-th frame of the extended speech signal in the pronunciation probability matrix based on the probability value from large to small, the sequence number corresponding to the c-th pronunciation; K is a preset parameter.
- the first network level in the asynchronous speech recognition model is more than the second network level in the real-time speech recognition model.
- the inputting the target language signal into a speech recognition model corresponding to the target language type to obtain text information output by the speech recognition model includes:
- the speech spectrum corresponding to each of the audio frames is sequentially imported into the real-time speech recognition model, and the text information is output.
- the method further includes :
- the target speech signal is imported into the training set corresponding to the target language type.
- an embodiment of the present application provides a voice recognition device, including:
- the target voice signal acquiring unit is used to acquire the target voice signal to be recognized
- the target language type recognition unit is used to determine the target language type of the target speech signal
- a speech recognition unit configured to input the target language signal into a speech recognition model corresponding to the target language type to obtain text information output by the speech recognition model;
- the speech recognition model is obtained by training through a training sample set, and the training sample set includes a plurality of extended speech signals, the extended text information corresponding to each extended speech signal, the original speech signal corresponding to each extended speech signal, and The original text information corresponding to each original speech signal, and the extended speech signal is obtained by conversion based on the existing text of the basic language type.
- the embodiments of the present application provide a terminal device, a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the A computer program is used to implement the method for speech recognition in any one of the above-mentioned first aspects.
- an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and is characterized in that, when the computer program is executed by a processor, any of the above-mentioned aspects of the first aspect is implemented.
- the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the voice recognition method described in any one of the above-mentioned first aspects.
- basic language text with a large sample number is converted into an extended speech signal, and the real-time speech recognition model corresponding to the target language type is trained through the original speech signal corresponding to the target language type and the extended speech signal, and after the training
- the real-time speech recognition model performs speech recognition on the target speech signal and outputs text information, which can increase the number of samples required for training the real-time speech recognition model for training non-basic languages, thereby improving the accuracy and applicability of speech recognition.
- FIG. 1 is a block diagram of a part of the structure of a mobile phone provided by an embodiment of the present application
- FIG. 2 is a schematic diagram of the software structure of a mobile phone according to an embodiment of the present application.
- FIG. 3 is an implementation flowchart of a voice recognition method provided by the first embodiment of the present application.
- FIG. 4 is a schematic structural diagram of a speech recognition system provided by an embodiment of the present application.
- FIG. 5 is an interaction flowchart of a voice recognition system provided by an embodiment of the present application.
- FIG. 6 is a specific implementation flowchart of a voice recognition method provided by the second embodiment of the present application.
- Fig. 7 is a schematic diagram of an extended speech-to-text conversion provided by an embodiment of the present application.
- FIG. 8 is a specific implementation flowchart of a voice recognition method provided by the third embodiment of the present application.
- Fig. 9 is a schematic structural diagram of an asynchronous speech recognition model and a real-time speech recognition model provided by an embodiment of the present application.
- FIG. 10 is a specific implementation flow chart of a method S803 for speech recognition provided by the fourth embodiment of the present application.
- FIG. 11 is a specific implementation flowchart of a method S1001 for speech recognition provided by the fifth embodiment of the present application.
- FIG. 12 is a schematic diagram of a training process of a real-time speech model provided by an embodiment of the present application.
- FIG. 13 is a specific implementation flowchart of a voice recognition method S303 provided by the sixth embodiment of the present application.
- FIG. 15 is a structural block diagram of a speech recognition device provided by an embodiment of the present application.
- FIG. 16 is a schematic diagram of a terminal device provided by another embodiment of the present application.
- the term “if” can be construed as “when” or “once” or “in response to determination” or “in response to detecting “.
- the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
- the voice recognition method provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super mobile personal Computers (ultra-mobile personal computers, UMPC), netbooks, personal digital assistants (personal digital assistants, PDAs) and other terminal devices can also be applied to databases, servers, and service response systems based on terminal artificial intelligence to respond to voice recognition Request, the embodiment of this application does not impose any restriction on the specific type of terminal device.
- AR augmented reality
- VR virtual reality
- netbooks personal digital assistants
- PDAs personal digital assistants
- PDAs personal digital assistants
- other terminal devices can also be applied to databases, servers, and service response systems based on terminal artificial intelligence to respond to voice recognition Request, the embodiment of this application does not impose any restriction on the specific type of terminal device.
- the terminal device may be a station (STAION, ST) in a WLAN, a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, Personal Digital Assistant (PDA) devices, handheld devices with wireless communication capabilities, computing devices or other processing devices connected to wireless modems, computers, laptops, handheld communication devices, handheld computing devices, and /Or other devices used to communicate on the wireless system and next-generation communication systems, for example, mobile terminals in 5G networks or mobile terminals in the future evolved Public Land Mobile Network (PLMN) network, etc.
- STAION, ST station
- WLAN Wireless Local Loop
- PDA Personal Digital Assistant
- the wearable device can also be a general term for applying wearable technology to intelligently design daily wear and develop wearable devices, such as glasses, gloves, Watches, clothing and shoes, etc.
- a wearable device is a portable device that is directly worn on the body or integrated into the user's clothes or accessories, and is attached to the user's body to collect the user's atrial fibrillation signal. Wearable devices are not only a kind of hardware device, but also realize powerful functions through software support, data interaction, and cloud interaction.
- wearable smart devices include full-featured, large-sized, complete or partial functions that can be implemented without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, and need to be used in conjunction with other devices such as smart phones. , Such as all kinds of smart bracelets and smart jewelry for physical sign monitoring.
- Fig. 1 shows a block diagram of a part of the structure of a mobile phone provided in an embodiment of the present application.
- the mobile phone includes: a radio frequency (RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a near field communication module 170, a processor 180, and a power supply 190.
- RF radio frequency
- FIG. 1 does not constitute a limitation on the mobile phone, and may include more or fewer components than those shown in the figure, or a combination of some components, or different component arrangements.
- the RF circuit 110 can be used for receiving and sending signals during information transmission or communication. In particular, after receiving the downlink information of the base station, it is processed by the processor 180; in addition, the designed uplink data is sent to the base station.
- the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
- the RF circuit 110 can also communicate with the network and other devices through wireless communication.
- the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE)), Email, Short Messaging Service (SMS), etc., through RF circuits 110 receives voice signals collected by other terminals, recognizes the voice signals, and outputs corresponding text information.
- GSM Global System of Mobile Communication
- GPRS General Packet Radio Service
- CDMA Code Division Multiple Access
- WCDMA Wideband Code Division Multiple Access
- LTE Long Term Evolution
- Email Short Messaging Service
- SMS Short Messaging Service
- the memory 120 can be used to store software programs and modules.
- the processor 180 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 120, such as storing a trained real-time speech recognition algorithm in the memory 120 Inside.
- the memory 120 may mainly include a program storage area and a data storage area.
- the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of mobile phones (such as audio data, phone book, etc.), etc.
- the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
- the input unit 130 may be used to receive inputted numeric or character information, and generate key signal input related to user settings and function control of the mobile phone 100.
- the input unit 130 may include a touch panel 131 and other input devices 132.
- the touch panel 131 also known as a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 131 or near the touch panel 131. Operation), and drive the corresponding connection device according to the preset program.
- the display unit 140 may be used to display information input by the user or information provided to the user and various menus of the mobile phone, for example, output text information after voice recognition.
- the display unit 140 may include a display panel 141.
- the display panel 141 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), etc.
- the touch panel 131 can cover the display panel 141. When the touch panel 131 detects a touch operation on or near it, it transmits it to the processor 180 to determine the type of the touch event, and then the processor 180 responds to the touch event. The type provides corresponding visual output on the display panel 141.
- the touch panel 131 and the display panel 141 are used as two independent components to realize the input and input functions of the mobile phone, but in some embodiments, the touch panel 131 and the display panel 141 can be integrated. Realize the input and output functions of the mobile phone.
- the mobile phone 100 may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors.
- the light sensor may include an ambient light sensor and a proximity sensor.
- the ambient light sensor can adjust the brightness of the display panel 141 according to the brightness of the ambient light.
- the proximity sensor can close the display panel 141 and/or when the mobile phone is moved to the ear. Or backlight.
- the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when it is stationary.
- the audio circuit 160, the speaker 161, and the microphone 162 can provide an audio interface between the user and the mobile phone.
- the audio circuit 160 can transmit the electrical signal converted from the received audio data to the speaker 161, which is converted into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, which is then output by the audio circuit 160.
- the terminal device may collect the user's target voice signal through the microphone 162, and send the converted electrical signal to the processor of the terminal device for voice recognition.
- the terminal device can receive atrial fibrillation signals sent by other devices through the near field communication module 170.
- the near field communication module 170 is integrated with a Bluetooth communication module, establishes a communication connection with the wearable device through the Bluetooth communication module, and receives feedback from the wearable device The target voice signal.
- FIG. 1 shows the near field communication module 170, it can be understood that it is not a necessary component of the mobile phone 100, and can be omitted as needed without changing the essence of the application.
- the processor 180 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone. It executes by running or executing software programs and/or modules stored in the memory 120 and calling data stored in the memory 120. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole.
- the processor 180 may include one or more processing units; preferably, the processor 180 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 180.
- the mobile phone 100 also includes a power source 190 (such as a battery) for supplying power to various components.
- a power source 190 such as a battery
- the power source can be logically connected to the processor 180 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
- the Android system is divided into four layers, namely the application layer, the application framework layer (framework, FWK), the system layer, and the hardware abstraction layer. Communication between the layers through software interface.
- the application layer can be a series of application packages, which can include applications such as short message, calendar, camera, video, navigation, gallery, and call.
- the voice recognition algorithm can be embedded in the application program, the voice recognition process is started through the relevant controls in the application program, and the collected target voice signal is processed to obtain the corresponding text information.
- the application framework layer provides application programming interfaces (application programming interface, API) and programming frameworks for applications in the application layer.
- the application framework layer may include some predefined functions, such as functions for receiving events sent by the application framework layer.
- the application framework layer can include a window manager, a resource manager, and a notification manager.
- the window manager is used to manage window programs.
- the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
- the content provider is used to store and retrieve data and make these data accessible to applications.
- the data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
- the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
- the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
- the notification manager is used to notify download completion, message reminders, and so on.
- the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.
- the application framework layer can also include:
- a view system which includes visual controls, such as controls that display text, controls that display pictures, and so on.
- the view system can be used to build applications.
- the display interface can be composed of one or more views.
- a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
- the phone manager is used to provide the communication function of the mobile phone 100. For example, the management of the call status (including connecting, hanging up, etc.).
- the system layer can include multiple functional modules. For example: sensor service module, physical state recognition module, 3D graphics processing library (for example: OpenGL ES), etc.
- the sensor service module is used to monitor the sensor data uploaded by various sensors at the hardware layer and determine the physical state of the mobile phone 100;
- Physical state recognition module used to analyze and recognize user gestures, faces, etc.
- the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
- the system layer can also include:
- the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
- the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
- the media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
- the hardware abstraction layer is the layer between hardware and software.
- the hardware abstraction layer may include display drivers, camera drivers, sensor drivers, microphone drivers, etc., used to drive related hardware at the hardware layer, such as display screens, cameras, sensors, and microphones.
- the microphone module is driven by the microphone to collect the user's target voice information, and follow the voice recognition process in a straight line.
- voice recognition method provided in the embodiments of the present application can be executed in any of the above-mentioned levels, which is not limited herein.
- the execution subject of the process is a device installed with a voice recognition program.
- the device of the voice recognition program may specifically be a terminal device.
- the terminal device may be a smart phone, a tablet computer, a notebook computer, a server, etc. used by the user to recognize the obtained voice signal and determine the The text information corresponding to the voice signal realizes the purpose of converting the voice signal into text information.
- FIG. 3 shows an implementation flowchart of the voice recognition method provided by the first embodiment of the present application, and the details are as follows:
- a target voice signal to be recognized is acquired.
- the terminal device can collect the user's target voice signal through the built-in microphone module.
- the user can activate the microphone module by starting a specific application in the terminal device, such as a recording application, a real-time call voice call application
- the user can also click on some controls in the current application to activate the microphone module. For example, click on the voice-sending control in a social application to send the collected voice signal as interactive information to the communication peer.
- the terminal device will The voice signal generated by the user during the click operation is collected through the microphone module as the above-mentioned target voice signal; the terminal device has a built-in input method application, which supports the voice input function, and the user can activate the terminal device by clicking the input control Input method application, and select the voice input text function.
- the terminal device can start the microphone module, collect the user's target voice signal through the microphone module, and convert the target voice signal into text information, and use the text information as the required input
- the parameters are imported into the input control.
- the terminal device can also collect the user’s target voice signal through an external microphone module. In this case, the terminal device can establish a communication connection with the external microphone module through a wireless communication module or a serial interface.
- the user can click on the microphone
- the recording button on the module starts the microphone module to collect the target voice signal, and transmits the collected target voice signal to the terminal device through the communication connection established above. After the terminal device receives the target voice signal fed back by the microphone module, it can execute the subsequent Speech recognition process.
- the terminal device may also acquire the target voice signal through the communication peer.
- the terminal device can establish a communication connection with the communication peer through the communication module, and receive the target voice signal sent by the communication peer through the communication connection.
- the method of collecting the target voice signal by the communication peer can refer to the above process, which will not be repeated here.
- the terminal device After receiving the target voice signal fed back by the communication peer, the terminal device can perform voice recognition on the target voice signal.
- the terminal device A and the terminal device B establish a communication link for transmitting interactive data based on social applications, and the terminal device B collects a target voice signal through the built-in microphone module, and connects The target voice signal is sent to the terminal device A through the communication link established above for transmitting interactive data.
- Terminal device A can play the aforementioned target voice signal through the speaker module, and the user of terminal device A can obtain the interactive content by listening; if the user of terminal device A cannot listen to the target voice signal, they can click the "text conversion" button, Recognize the text information corresponding to the target voice signal, and display the interactive content by outputting the text information.
- the terminal device may preprocess the target voice signal through a preset signal optimization algorithm, so as to improve the accuracy of subsequent voice recognition.
- the optimization method includes but is not limited to one or a combination of the following: signal amplification, signal filtering, abnormal detection, signal repair, etc.
- anomaly detection specifically refers to extracting multiple waveform feature parameters, such as signal-to-noise ratio, effective voice duration, duration of effective voice, etc., based on the collected target voice signal signal waveform, and obtaining waveform features based on the above collection
- the value calculates the signal quality of the target voice signal. If it is detected that the signal quality is lower than the effective signal threshold, the target voice signal is identified as an invalid signal, and subsequent speech recognition operations are not performed on the invalid signal. Conversely, if the signal quality is higher than the effective signal threshold, the target voice signal is recognized as a valid signal, and the operations of S302 and S303 are performed.
- the signal repair specifically refers to performing waveform fitting on the interrupted area in the process of collecting the target voice signal through a preset waveform fitting algorithm to generate a continuous target voice signal.
- the waveform fitting algorithm can be a neural network.
- the parameters in the waveform fitting algorithm are adjusted so that the waveform trend of the target voice signal after fitting is the same as the waveform trend of the target user. Match, thereby improving the waveform fitting effect.
- the signal repair operation is performed after the above-mentioned abnormality detection operation, because when the missing waveform of the target voice signal is modified by the signal, the acquisition quality of the target voice signal will be improved, thereby affecting the operation of the abnormality detection, so that the acquisition quality cannot be poor.
- the terminal device can first determine whether the target voice signal is a valid signal through the anomaly detection algorithm; if the target voice signal is a valid signal, the ECG signal is repaired through the signal repair algorithm; otherwise, If the target voice signal is an abnormal signal, there is no need to perform signal repair, thereby reducing unnecessary repair operations.
- the terminal device can extract the effective voice segment in the target voice signal through the voice activity detection algorithm, where the effective voice segment specifically refers to the voice segment containing the content of speech, and the invalid voice segment specifically refers to the voice segment. Refers to the voice terminal that does not contain the content of speech.
- the terminal device can set the voice start amplitude and the voice end amplitude, where the value of the voice start amplitude is greater than the value of the voice end amplitude. That is, the start requirement of the effective voice end is higher than the end requirement of the effective voice segment.
- the terminal device can perform effective voice recognition on the voice waveform according to the voice start amplitude and the voice end amplitude, thereby dividing and obtaining multiple effective voice segments, where the amplitude corresponding to the revelation moment of the effective voice segment is greater than or equal to the voice start Amplitude, and the corresponding amplitude at the end time is less than or equal to the ending amplitude of the voice.
- the terminal device can perform voice recognition on the effective voice segment, while the invalid voice segment does not need to be recognized, so that the signal length of the voice recognition can be reduced, thereby improving the recognition efficiency.
- the target voice signal may specifically be an audio stream containing multiple voice frames, wherein the sampling rate of the audio stream is specifically 16kHz, that is, 16k voice signals are collected per second Point, and each signal point is represented by 16 bits, that is, the bit depth is 16 bits.
- the frame length of each speech frame is 25ms, and the interval between each speech frame is 10ms.
- the target language type of the target voice signal is determined.
- the terminal device can determine the target language type corresponding to the target voice signal through a preset language recognition algorithm. Since the target voice signal may be based on voice signals in different language types, and different language types correspond to different voice recognition algorithms, it is necessary to determine the target voice type corresponding to the target voice signal before performing voice recognition.
- the target language type can be divided based on language types, such as Chinese, English, Russian, German, French, and Japanese, etc., and can also be divided based on regional dialect types. For Chinese, it can be divided into: Mandarin, Cantonese, and Shanghai dialect , Sichuan dialect, etc., for Japanese can be divided into: Kansai dialect and standard Japanese.
- the terminal device may receive the geographic range input by the user, such as the Asian range, China range, or Guangdong range, etc.
- the terminal device may determine the language type contained in the region based on the geographic range input by the user, and Adjust the language recognition algorithm based on all language types in the region.
- the geographical scope is the Guangdong scope
- the language types included in the Guangdong scope are: Cantonese, Chaoshan dialect, Hakka, and Mandarin. Based on the above four language types, the corresponding language recognition algorithm is configured.
- the terminal device can also obtain the location information when the terminal device collects the target voice signal through the built-in positioning device, and determine the geographic range based on the location information, thereby eliminating the need for manual input by the user to improve the degree of automation.
- the terminal device can filter out language types with a low recognition probability based on the above-mentioned geographical range, thereby improving the accuracy of the language recognition algorithm.
- the terminal device may specifically be a voice recognition server.
- the voice recognition server can receive the target voice signal sent by each user terminal, and determine the target language type of the target voice signal through the built-in language recognition algorithm, and extract the real-time voice recognition model corresponding to the target language type from the database to recognize the target
- the text information corresponding to the voice signal feeds back the text information to the user terminal.
- FIG. 4 shows a schematic structural diagram of a speech recognition system provided by an embodiment of the present application.
- the voice recognition system includes a user terminal 41 and a voice recognition server 42.
- the user can collect the target voice signal that needs to be recognized through the user terminal 41.
- the terminal device 41 can be installed with a client program corresponding to the voice recognition server 42, and establish a communication connection with the voice recognition server 42 through the client program, and connect the collected
- the target voice signal is sent to the voice recognition server 42 through the client program. Since the voice recognition server 42 adopts a real-time voice recognition model, it can respond to the user's voice recognition request in real time and feed back the voice recognition result to the user terminal through the client program. 41.
- the user terminal 41 may output the text information in the voice recognition result to the user through an interactive module, such as a display or a touch screen, to complete the voice recognition process.
- the terminal device can call the application program interface API provided by the speech recognition server to send the target language signal to be recognized to the speech recognition server, and determine the target through the language recognition algorithm built in the speech recognition server The target language type of the voice signal is selected, and then the voice recognition algorithm corresponding to the target language type is selected, the text information of the target voice signal is output, and the text information is fed back to the terminal device through the API interface.
- the speech recognition model is obtained by training through a training sample set, and the training sample set includes a plurality of extended speech signals, the extended text information corresponding to each extended speech signal, the original speech signal corresponding to each extended speech signal, and The original text information corresponding to each original speech signal, and the extended speech signal is obtained by conversion based on the existing text of the basic language type.
- the terminal device after the terminal device determines the target language type corresponding to the target voice signal, it can obtain a real-time speech recognition model corresponding to the target language type.
- the built-in memory of the terminal device can store various language types.
- the terminal device can select the corresponding real-time voice recognition model from the memory according to the type number of the target language type; the terminal device can also send a model acquisition request to the cloud server, and the model acquisition request carries the recognition result
- the cloud server can feed back the real-time speech recognition model corresponding to the type number to the terminal device.
- the basic language type is Putonghua. Because there are more users and more occasions, so The number of speech samples that can be collected is large. When training real-time speech recognition models, due to the large number of samples, it has a good training effect, which in turn makes the output accuracy of real-time speech recognition models of basic language types higher .
- non-basic language types such as local dialects
- Chinese local dialects are other languages that are different from Mandarin, such as Cantonese, Chaoshan dialect, Shanghai dialect, Beijing dialect, Tianjin dialect, etc.
- the training set used in the training of the real-time speech recognition model in this embodiment of the application except for the original speech In addition to the signal, it also contains an extended voice signal.
- the original speech signal refers to the target language type used by the speech object corresponding to the signal, that is, the speech signal spoken based on the target language type.
- the extended voice signal is not the original signal that is actually collected, but the basic language text corresponding to the basic language type is imported into the preset voice synthesis algorithm, and the synthesized voice signal is output. Since the number of basic language texts edited in the basic language type is large, the number of samples is large, which can improve the coverage of training. For example, most Chinese books, notices, and Internet articles are written based on Putonghua as the reading language, while the amount of texts in local dialects such as Cantonese or Northeastern dialects as the reading language is relatively small. Therefore, it can be based on the above The basic language text corresponding to the basic language type is converted into an extended language signal to expand the number of samples for non-basic language types.
- the way to obtain the original voice signal may be: the terminal device can download a corpus of the target language type from multiple preset cloud servers, and the corpus stores multiple histories about the target language type. voice signal.
- the terminal equipment sorts all historical voice signals, and uses the sorted historical voice signals as the original voice signals in the training set.
- the aforementioned historical voice signal can be obtained from a screenshot of the audio data of the video file.
- the tag of a movie file contains the dubbing language. If the dubbing language matches the target language type, the audio data in the movie file is based on The voice signal of the target language type is recorded, so the above-mentioned original voice signal can be obtained from the audio data in the movie file.
- the original voice signal can also be extracted from the existing files.
- the method of generating the extended voice signal may be: the terminal device can perform semantic analysis on the existing text of the basic language type through the semantic recognition algorithm, and determine the text keywords contained in the existing text, and Determine the keyword translated name corresponding to each text keyword in the target language type, obtain the translated name pronunciation corresponding to each keyword translated name, and generate the above-mentioned extended text based on the translated name pronunciation of all keyword translated names.
- FIG. 5 shows an interaction flowchart of a voice recognition system provided by an embodiment of the present application.
- the voice recognition system includes a user terminal and a voice recognition server.
- the speech recognition server includes a number of different modules, namely a language type recognition module and a real-time speech recognition module corresponding to different language types.
- the real-time speech recognition module contains a basic language type real-time speech recognition module and a local Real-time speech recognition module for dialects.
- the voice type recognition module in the voice recognition server determines the target language type of the target voice signal, and transmits it to the corresponding real-time voice recognition module for voice recognition. Recognition, to output the corresponding text information, and feedback the output text information to the user terminal.
- the terminal device can train the native voice recognition model through the original voice signal and the extended voice signal obtained by converting the existing text of the basic language type.
- the recognition result of the native voice recognition model converges and the corresponding loss function If it is less than the preset loss threshold, it is recognized that the adjustment of the native speech recognition model has been completed.
- the adjusted native speech recognition model can be used as the above-mentioned real-time speech recognition model in response to the initiated speech recognition operation.
- ASR automatic Speech Recognition
- the ASR model can also collect a large amount of user data during the use process. If these data can be annotated in an automated way, the number of training corpus can be expanded on a large scale, thereby improving speech recognition Accuracy.
- the ASR model is required to adapt to different language types through self-learning, so as to achieve high recognition accuracy for all language types.
- the training corpus of some dialects is insufficient, which affects the recognition rate of such dialects.
- the number of samples for various dialects is seriously unbalanced.
- different real-time speech recognition models can be configured according to the geographical information collected by the voice signal, so that the real-time speech recognition model can be trained based on the administrative region division rules such as provinces or urban areas to achieve Targeted model training.
- the above methods rely on province-differentiated accents and cannot achieve fine accent modeling. Because some provinces have very different dialects, dialects in the same province have completely different pronunciation methods and even phrases, which cannot guarantee the consistency of the province’s accent, resulting in real-time voice The training granularity is large, which reduces the accuracy of recognition. On the other hand, some dialects have more people, such as Cantonese and Shanghainese. The above-mentioned people can be distributed in many different provinces, which makes the above-mentioned inability to target specific Targeted optimization of dialects reduces the accuracy of recognition.
- the method provided in this embodiment can take advantage of the large number of samples of the basic language type and high coverage characteristics to convert the existing text of the basic language type into an extended language signal of the target language type.
- the conversion method is directional conversion, so the generated extended language signal must be based on the target language type speech signal, which eliminates the need for users to manually mark, reduces labor costs, and can also provide a large amount of training corpus for local dialects.
- the sample balance of different language types is realized, and the accuracy of the training operation is improved.
- the method for speech recognition converts basic language text with a large sample number into an extended speech signal, and uses the original speech signal corresponding to the target language type and the extended speech signal to affect the target language type.
- the corresponding real-time speech recognition model is trained, and the target speech signal is recognized through the trained real-time speech recognition model, and text information is output, which can increase the number of samples required for training the real-time speech recognition model of non-basic languages. Thereby improving the accuracy and applicability of speech recognition.
- FIG. 6 shows a specific implementation flowchart of a voice recognition method provided by the second embodiment of the present application.
- the target language signal is input to the speech recognition model corresponding to the target language type to obtain the Before the text information output by the speech recognition model, it also includes: S601 to S603, which are detailed as follows:
- the method before the inputting the target language signal into the speech recognition model corresponding to the target language type to obtain the text information output by the speech recognition model, the method further includes:
- the terminal device can download the text from the cloud database.
- the library extracts the existing text of the basic language type, and can also crawl data from the Internet to obtain the text of the basic language type in multiple recording languages, so as to obtain the above-mentioned existing text.
- the terminal device acquires the historical voice signal sent by the user when responding to the voice recognition operation initiated by the user. If it detects that the language type corresponding to the historical voice signal is the basic language type, it can The historical text generated by the historical voice signal is used as the existing text recorded based on the basic language type, thereby achieving the purpose of self-collecting training data, increasing the number of training samples, and then improving the recognition accuracy of the real-time voice recognition model.
- different target language types correspond to different basic language types
- the terminal device can establish a basic language correspondence to determine the basic language types associated with different target language types.
- one target language type corresponds to one basic language type
- one basic language type can correspond to multiple target language types.
- the basic language type is Mandarin, which belongs to all the language types of the Chinese language category, and the corresponding basic language type is Mandarin
- the basic language type is English.
- English which belongs to all the language types of the English language category, and its corresponding basic language type is British English, so that the correspondence between different language types and basic language types can be determined.
- the terminal device can determine the basic language type corresponding to the target language type according to the basic language correspondence established above, and obtain the existing text of the basic language type.
- the existing text is converted into an extended speech text corresponding to the target language type.
- the terminal device can determine the translation algorithm between the basic language type and the target language type, and import the existing text into the translation algorithm to generate the extended speech text. Since the existing text is recorded based on the basic language type, the vocabulary and grammar in it are determined according to the basic language type, and different language types, the vocabulary and grammar used will be different, in order to improve the accuracy of the subsequent expanded speech signal.
- the terminal device does not directly generate the corresponding synthesized speech based on the existing text, but first translates the existing text, so that it can generate the grammatical structure of the target language type and the extended speech text with the word specification, so as to improve the subsequent recognition accuracy.
- the terminal device can verify the correctness of the translation after the extended voice text is converted.
- the terminal device can determine each entity contained in the existing text through the semantic analysis algorithm, and obtain the corresponding translated name of each entity in the target language type; detect whether each translated name is in the converted extended voice text, if each translated name is in the extended voice Within the text, identify the mutual positional relationship between each translated name, and determine whether the translated names conform to the grammatical structure of the target language type based on the mutual positional relationship, and if the mutual positional relationship satisfies the grammatical structure, the recognition and translation is correct; Conversely, if the mutual positional relationship does not satisfy the grammatical structure and/or, and the translated name is not included in the extended speech text, the recognition translation fails and the translation algorithm needs to be readjusted.
- the terminal device can obtain the standard pronunciation corresponding to each character in the extended speech text through the speech synthesis algorithm, and determine the phrase contained in the extended speech text through the semantic recognition algorithm, and determine the word space between each phrase
- the interval time and the character interval time between different characters in the phrase, according to the interval time between words, the interval time between characters, and the standard pronunciation corresponding to each character, the expanded speech signal corresponding to the expanded speech text is generated, thereby generating the target language type as the conversation Extended speech signal of the language.
- the terminal device can establish corresponding corpora for different target language types. Each corpus records multiple basic pronunciations of the target language type. After obtaining the character corresponding to the target language type, the terminal device can determine the basic pronunciation contained in the character, merge and transform based on multiple basic pronunciations to obtain the standard pronunciation corresponding to the character, which can be based on the standard corresponding to each character Pronunciation, generate extended speech signal.
- FIG. 7 shows a schematic diagram of the extended speech-to-text conversion provided by an embodiment of the present application.
- An existing text obtained by the terminal device is "I don't have what you want here", and its corresponding basic language type is Mandarin, and the target language type is Cantonese.
- the terminal device can use the translation algorithm between Mandarin and Cantonese to translate
- the above-mentioned existing text is translated into an extended speech text based on Cantonese, and the translation result is "I don’t want it", and the above-mentioned extended speech text is imported into the Cantonese speech synthesis algorithm to obtain the corresponding extended speech signal.
- An expanded speech signal used to indicate "I don't have what you want here” is obtained, which achieves the purpose of sample expansion.
- FIG. 8 shows a specific implementation flowchart of a voice recognition method provided by the third embodiment of the present application.
- the target language signal is input to the speech recognition model corresponding to the target language type to obtain the Before the text information output by the speech recognition model, it also includes: S801 ⁇ S803, the details are as follows:
- the method before the inputting the target language signal into the speech recognition model corresponding to the target language type to obtain the text information output by the speech recognition model, the method further includes:
- the first native voice model is trained through the original voice signal in the training set and the original language text corresponding to the original voice signal to obtain an asynchronous voice recognition model.
- the terminal device may be configured with two different speech recognition models, which are a real-time speech recognition model capable of responding to real-time speech recognition operations and an asynchronous speech recognition model that requires a longer response time.
- the real-time speech recognition model can be built based on a neural network.
- the neural network that builds the above-mentioned real-time speech recognition model has fewer network levels, so it has a faster response efficiency, but at the same time, the recognition accuracy is lower than that of the asynchronous speech recognition model.
- the asynchronous speech recognition model can also be built based on a neural network.
- the neural network that builds the above asynchronous speech recognition model has more network levels, which requires a longer time for recognition and lower response efficiency, but at the same time, the accuracy of recognition Higher than the real-time speech recognition model.
- the asynchronous speech recognition model is used to correct the data in the training process of the real-time speech recognition model, thereby improving the accuracy of the real-time speech recognition model.
- the real-time speech recognition model and the asynchronous speech recognition model may be constructed based on neural networks of the same structure, or constructed of neural networks of different types of structures, which are not limited here. Therefore, between the second native voice model used to build the real-time voice recognition model and the first native voice model used to build the asynchronous voice recognition model, it can also be built based on a neural network of the same structure, or it can be a different type of structure The neural network is built, which is not limited here.
- the original voice signal is a voice signal that has not been converted.
- the pronunciation of each byte in the original voice signal will vary according to the user. Therefore, it has a higher coverage rate for the test process, and due to the user’s The pronunciation will deviate from the standard pronunciation, and the subsequent training process can also be identified and corrected.
- the terminal device can use the original voice signal and the original language text corresponding to the original voice signal as training samples to train the first native voice model, and converge the training result and the loss of the model is less than the preset loss threshold
- the network parameters corresponding to the time are used as the trained network parameters
- the first native speech model is configured based on the trained network parameters to obtain the aforementioned asynchronous speech recognition model.
- the function for calculating the amount of loss used in the first native voice model may be a temporal connectivity classification loss function (Connectionist Temporal Classification Loss, CTC Loss), and the CTC Loss can be specifically expressed as:
- Loss ctc is the aforementioned loss function
- x is the original speech signal
- z is the original language text corresponding to the original speech signal
- S is the training set composed of all the original speech signals
- x) is the output based on the original speech signal The probability value of the original language text.
- the first network level in the asynchronous speech recognition model is more than the second network level in the real-time speech recognition model.
- the above two speech recognition models are specifically speech recognition models built based on neural networks of the same structure, and the asynchronous speech recognition model contains more first network levels than the second network level of the real-time speech recognition model, so The asynchronous speech recognition model has better recognition accuracy, but the speech recognition operation takes a longer time, so it is suitable for non-real-time asynchronous response scenarios.
- different users can send audio files that need to perform voice recognition to the terminal device, and the terminal device can import the above audio files into the asynchronous voice recognition model.
- the user terminal and the terminal device can configure the communication link It is a long connection link, and the operation of the asynchronous speech recognition model is detected at a preset time interval.
- the asynchronous voice recognition model can add each speech recognition task to a preset task list, and process them in sequence based on the addition order of each speech recognition task, and send each speech recognition result to each user terminal.
- the real-time voice recognition model can respond to the voice recognition request sent by the user in real time.
- a real-time transmission link can be established between the user terminal and the terminal device.
- the voice signal corresponds to the audio stream.
- the terminal device imports the audio stream into the real-time voice recognition model, that is, while the user terminal collects the user’s voice signal, the real-time voice recognition model can perform voice recognition on the audio frames that have been fed back in the voice signal.
- the user terminal can send the complete audio stream to the terminal device, and the terminal device transmits the remaining unrecognized audio frames that are subsequently received to the real-time voice recognition model to generate the voice recognition result, that is, text information, and Feedback to the user terminal realizes the purpose of responding to the voice recognition request initiated by the user in real time.
- FIG. 9 shows a schematic structural diagram of an asynchronous speech recognition model and a real-time speech recognition model provided by an embodiment of the present application.
- the real-time speech recognition model and the asynchronous speech recognition model belong to a neural network with the same network structure, including a frequency feature extraction layer, a convolutional layer CNN, a recurrent neural network layer Bi-RNN, and a fully connected layer.
- the number of layers of the frequency feature extraction layer and the fully connected layer in the real-time speech recognition model and the asynchronous speech recognition model are the same, and both are one layer.
- the frequency feature extraction layer can extract the spectrum feature value of the speech spectrum obtained by the audio stream conversion to obtain a frequency feature matrix; and the fully connected layer can determine the multiple pronunciation probabilities of each audio frame by using the feature vector finally output by the above input level , Generate the pronunciation probability matrix, and output the text information corresponding to the voice signal based on the pronunciation probability matrix.
- the real-time speech recognition model includes two convolutional layers and four recurrent neural network layers; the asynchronous speech recognition model includes three convolutional layers and nine recurrent neural network layers. Through multiple convolutional layers and cyclic neural network layers, it has better feature extraction features, thereby improving the accuracy of recognition, but in contrast, the more network layers, the longer the calculation time required. Therefore, the real-time speech recognition model It is necessary to balance recognition accuracy and response time, and the number of configured network levels will be less than that of the asynchronous speech recognition model.
- the recognition accuracy of the asynchronous speech recognition model can be improved, so that the subsequent training process of the real-time speech recognition model can be monitored and corrected, thereby improving Recognition accuracy of real-time speech recognition model.
- each extended speech signal can be imported into the aforementioned asynchronous speech recognition model to generate a pronunciation probability matrix corresponding to each extended speech signal. Since the extended speech signal is specifically composed of different speech frames, different speech frames correspond to one pronunciation, and since the final fully connected layer of the speech recognition model is used to output the probability values of different pronunciations, each speech frame can correspond to multiple Different candidate pronunciations correspond to different probability values, and then the corresponding text information can be finally generated according to the contextual relevance of the characters corresponding to each pronunciation and the probability value of each character. Based on this, different speech frames can correspond to multiple different pronunciations, and different pronunciations correspond to different probability values. Integrating the candidate voices corresponding to each voice frame can generate a pronunciation probability matrix.
- Table 1 shows a schematic diagram of the pronunciation probability matrix provided by an embodiment of the present application.
- the extended speech signal includes four speech frames, T1 to T4, and each speech frame can be used to represent a character.
- T1 After the first speech frame T1 is recognized by the asynchronous speech recognition model, it corresponds to 4 different candidate pronunciations, namely "xiao”, “xing”, “liao” and “liang", and the probability value corresponding to each pronunciation is 61 %, 15%, 21%, and 3%.
- each subsequent speech frame also has multiple candidate characters, and different candidate characters correspond to a pronunciation probability.
- a second native speech model is trained according to the pronunciation probability matrix and the extended speech signal to obtain the real-time speech recognition model.
- the terminal device can train the second native speech model in conjunction with the asynchronous speech recognition model and the existing training samples to obtain the real-time speech recognition model, thereby improving the recognition accuracy of the real-time speech recognition model.
- the specific function of the asynchronous speech recognition model is to supervise and predict and correct the training process of the second native speech model, so as to improve the training efficiency and accuracy of the second native speech model, so as to obtain a real-time speech recognition model.
- each input in the training set corresponds to only one standard output result.
- the output result obtained by recognition may have multiple candidate pronunciations. If it corresponds to only one standard output result and training according to the standard output result, the direction of speech prediction cannot be determined. Whether it is accurate, thereby reducing the accuracy of training.
- this application introduces an asynchronous speech recognition model to correct the speech prediction direction of the real-time speech recognition model.
- the real-time speech recognition model is performed based on the above-mentioned pronunciation probability matrix. Training, because the asynchronous speech recognition model has higher accuracy and reliability, it can ensure that the speech prediction direction of the real-time speech recognition model is consistent with the speech recognition direction of the asynchronous speech recognition model, thereby improving the accuracy of the real-time speech recognition model.
- the process of training the second native voice model may specifically be: importing the extended voice signal into the foregoing second native voice model, and generating a corresponding predicted pronunciation matrix.
- the pronunciation probability matrix and the predicted pronunciation matrix determine the difference between the candidate pronunciation and the deviation value between the same candidate pronunciation, calculate the deviation rate between the two matrices, and determine the loss of the second native speech model based on all the deviation rates.
- the amount of loss is adjusted to the second native speech model.
- the calculation function of the loss amount can still be calculated using the CTC Loss function.
- the specific function formula can refer to the above discussion and will not be repeated here.
- z in the function is the aforementioned pronunciation probability matrix
- x) is Output the probability value of the above pronunciation probability matrix.
- the asynchronous speech recognition model is trained, and the training process of the real-time speech recognition model is supervised based on the asynchronous speech recognition model, thereby improving the training effect, realizing the correction of speech recognition, and improving the performance of the real-time speech recognition model. accuracy.
- FIG. 10 shows a specific implementation flowchart of a voice recognition method S803 provided by the fourth embodiment of the present application.
- S803 in a voice recognition method provided in this embodiment includes: S1001 to S1002, and the details are as follows:
- the training a second native speech model according to the pronunciation probability matrix and the extended speech signal to obtain the real-time speech recognition model includes
- the training process for the second native speech model is divided into two parts, one is a coarse-grained training process, and the other is a fine-grained training process.
- the coarse-grained training process specifically performs voice error correction and supervision through the pronunciation probability matrix generated by the asynchronous voice recognition model.
- the terminal device may use the extended voice signal as the training input of the second native voice model, use the pronunciation probability matrix as the training output of the second native voice model, and perform model training on the second native voice model until the second native voice model.
- the result of the native voice model converges, and the corresponding loss function is less than the preset loss threshold.
- the training of the second native voice model is completed, and the trained second native voice model is recognized as a quasi-real-time voice model for the next step Fine-grained training operations.
- the process of performing coarse-grained training on the second native voice model may specifically be: dividing the extended voice signal into multiple training groups, and the training group includes a certain number of extended voice signals and Extend the pronunciation probability matrix associated with the speech signal.
- the terminal device trains the above-mentioned second native voice model through each training group, and after training, uses the preset original voice signal as the verification set to import the second native voice model after each training, and calculates the verification
- the terminal device uses the network parameters of the second native voice model when the deviation rate is the smallest as the network parameters completed by training, and imports the network parameters completed based on the training into the second native voice model to obtain the above-mentioned standard deviation rate.
- Real-time voice model Real-time voice model.
- fine-grained training is performed on the quasi-real-time speech model according to the original speech signal and the original language text to obtain the real-time speech recognition model.
- the terminal device after the terminal device has generated the quasi-real-time speech recognition model, it can perform secondary training, that is, the above-mentioned fine-grained training, where the training data used in the fine-grained training is the original voice signal and corresponds to the original voice signal
- the original speech text The original voice signal is a voice signal that has not been converted.
- the pronunciation of each byte in the original voice signal will vary according to the user. Therefore, it has a higher coverage rate for the test process and is due to the user’s pronunciation. There will be deviations from the standard pronunciation, and the subsequent training process can also be identified and corrected.
- the terminal device can use the original speech signal and the original language text corresponding to the original speech signal as training samples, align the real-time speech model for training, and converge the training result and the loss of the model is less than the preset loss threshold.
- the corresponding network parameters are used as the trained network parameters, and the quasi-real-time speech model is configured based on the trained network parameters to obtain the above-mentioned real-time speech recognition model.
- the function for calculating the amount of loss used in the above-mentioned quasi-real-time speech model may be a temporal connectivity classification loss function (Connectionist Temporal Classification Loss, CTC Loss), and the CTC Loss can be specifically expressed as:
- Loss ctc is the aforementioned loss function
- x is the original speech signal
- z is the original language text corresponding to the original speech signal
- S is the training set composed of all the original speech signals
- x) is the output based on the original speech signal The probability value of the original language text.
- the second native speech model is trained in two stages to generate a real-time speech recognition model, the training samples are expanded by expanding the speech information, and the asynchronous speech recognition model is used to correct the training process to improve training. Accuracy.
- FIG. 11 shows a specific implementation flowchart of a voice recognition method S1001 provided by the fifth embodiment of the present application.
- S1001 in a voice recognition method provided in this embodiment includes: S1101 to S1103, which are detailed as follows:
- the coarse-grained training of the second native speech model based on the pronunciation probability matrix and the extended speech text to obtain a quasi-real-time speech model includes:
- the extended voice signal is imported into the second native voice model, and the prediction probability matrix corresponding to the extended voice signal is determined.
- the terminal device can use the extended voice signal as the input for training and import it into the second native voice model.
- the native second native voice model can determine the candidate pronunciation corresponding to each voice frame in the extended voice signal, and each For the determination probability of the candidate utterances, a prediction probability matrix is generated from the candidate utterances corresponding to all the speech frames and the associated determination probabilities.
- the structure of the prediction probability matrix is consistent with the structure of the pronunciation probability matrix. For specific description, please refer to the description of the foregoing embodiment. I won't repeat it here.
- the pronunciation probability matrix and the prediction probability matrix are imported into a preset loss function, and the loss amount of the second native speech model is calculated.
- each extended speech signal corresponds to two probability matrices, which are respectively the predicted probability matrix output based on the second native speech recognition model and the pronunciation probability matrix output based on the asynchronous speech recognition model.
- the terminal device can expand all the The two probability matrices corresponding to the speech signal are imported into the preset loss function, and the loss of the second native speech model is calculated. If each candidate pronunciation in the prediction probability matrix and the corresponding probability value match the pronunciation probability matrix higher, the corresponding loss value will be smaller, so that the recognition accuracy of the second native speech recognition model can be determined according to the loss amount.
- the loss function is specifically:
- Loss top_k is the loss amount; Is the probability value of the t-th frame and the c-th utterance in the extended speech signal in the prediction probability matrix; Is the probability value of the t-th frame and the c-th utterance in the extended speech signal in the pronunciation probability matrix processed by the optimization algorithm; T is the total number of frames; C is the total number of utterances recognized in the t-th frame; Is the probability value of the t-th frame and the c-th pronunciation in the extended speech signal in the pronunciation probability matrix; After sorting all the pronunciations of the t-th frame of the extended speech signal in the pronunciation probability matrix based on the probability value from large to small, the sequence number corresponding to the c-th pronunciation; K is a preset parameter.
- the aforementioned loss function is specifically used to train the second native speech model to learn the first K pronunciations with higher probability values in the asynchronous speech recognition model, and there is no need to learn the pronunciations with lower probability values. Therefore, For the first K pronunciations with larger probability values, the corresponding probability values remain unchanged, which is For other pronunciations except the first K, the optimized probability value is 0, and the corresponding learning efficiency is 0, so that the speech recognition correction of the second native speech model can be realized, and the correction effect can be improved at the same time. Taking into account the efficiency of correction, there is no need to learn other invalid pronunciation prediction behaviors with lower probability.
- Table 2 shows a pronunciation probability matrix processed by the optimization algorithm provided by this application.
- the pronunciation probability matrix of the optimized money can be seen in Table 1.
- the pronunciation probability matrix in Table 1 is not sorted according to the probability value. If the K value configured in the optimization algorithm is 2, the second native training model predicts and learns the first two pronunciations with the highest probability value. among them, Indicates the probability value of the first pronunciation in the first frame, that is, the pronunciation probability of "xiao", which is 61%.
- the corresponding ranking is 1 , which is 1 is less than or equal to the value of K, so the pronunciation probability is learned, that is versus Same, 61%; and Represents the probability value of the second pronunciation in the second frame, which is the pronunciation probability of "xing", which is 15%. Because the probability value of the probability value of all the pronunciations in the first frame is sorted from highest to lowest, it is the third , which is 3 is greater than the value of K, so the pronunciation probability is not learned, that is versus Not the same, it is 0, and so on, so as to obtain the pronunciation probability matrix processed by the optimization algorithm.
- the loss function is determined by using the Top-K method, so that the pronunciation prediction with higher probability can be learned, while the training accuracy is taken into account, the convergence speed can be improved, and the training effect can be improved; and It also achieves the purpose of compressing the pronunciation probability matrix output by the asynchronous language recognition model and saves storage space.
- network parameters in the second native speech model are adjusted based on the loss amount to obtain the quasi-real-time speech recognition model.
- the terminal device can adjust the second native speech model according to the aforementioned loss amount.
- the aforementioned loss amount is less than the preset loss threshold and the result converges, the corresponding network parameter is used as the network parameter after the training is completed.
- the trained network parameters configure the second native speech model to obtain a quasi-real-time speech recognition model.
- FIG. 12 shows a schematic diagram of a training process of a real-time speech model provided by an embodiment of the present application.
- the training process consists of three stages, namely a pre-training stage, a coarse-grained training stage, and a fine-grained training stage.
- the pre-training stage is to train the asynchronous speech model based on the original speech signal and original language text.
- the loss function used in the training process can be the CTC Loss function; and for the coarse-grained training stage, the extended speech probability matrix of the speech signal can be output through the trained asynchronous speech model, and based on the pronunciation probability matrix and the extended speech signal alignment Real-time speech model training, where the loss function used in the training process can be Top-K CE Loss function; the fine-grained training stage is based on the original speech signal and original language text to train the real-time speech model, and the loss used in the training process
- the function can be the CTC Loss function.
- the recognition loss between the second native speech model and the asynchronous speech recognition model is determined, so that the second native speech recognition model based on the asynchronous semantic recognition model
- the purpose of correcting the model is to improve the accuracy of training.
- FIG. 13 shows a specific implementation flowchart of a voice recognition method S303 provided by the sixth embodiment of the present application.
- a voice recognition method S303 provided in this embodiment includes: S1301 to S1303, which are detailed as follows:
- the inputting the target language signal into a speech recognition model corresponding to the target language type to obtain text information output by the speech recognition model includes:
- the target voice signal is divided into multiple audio frames.
- the voice signal can be composed of multiple different audio frames, different audio frames have a preset frame length, and each audio frame has a certain frame interval, and each audio frame is arranged based on the frame interval, then Get the complete audio stream of the above section.
- the terminal device can divide the target voice signal according to the preset frame interval and frame length to obtain multiple audio frames.
- Each audio frame can correspond to the pronunciation of a character.
- the terminal device can realize the conversion from time domain to frequency domain through discrete Fourier transform, so as to obtain the voice frequency band corresponding to each audio frame.
- the pronunciation frequency of each pronunciation can be determined according to the voice frequency band, so as to determine the pronunciation frequency of each pronunciation according to the pronunciation frequency. Determine the character corresponding to each pronunciation.
- the speech spectrum corresponding to each audio frame is sequentially imported into the real-time speech recognition model, and the text information is output.
- the terminal device can sequentially import the speech spectrum obtained by the conversion of each audio frame into the real-time speech recognition model according to the frame number associated with each audio frame in the target language signal, and the real-time speech recognition model can output each audio frame Corresponding pronunciation probabilities, and based on each candidate pronunciation probabilities and context relevance to generate corresponding text information.
- the voice spectrum of each audio frame in the target voice signal is obtained, so that the data processing efficiency of the real-time voice recognition model can be improved, and the recognition efficiency can be improved.
- FIG. 14 shows a specific implementation flowchart of a voice recognition method provided by the seventh embodiment of the present application.
- the method for speech recognition provided in this embodiment further includes: S1401, which is detailed in detail as follows:
- the method further includes:
- the terminal device after the terminal device outputs the text information corresponding to the target voice signal, it can import the target voice signal and the corresponding text information into the training set, thereby realizing automatic expansion of the training set.
- the number of samples in the training set is increased, thereby achieving the purpose of automatically expanding the sample set and improving the accuracy of the training operation.
- FIG. 15 shows a structural block diagram of a voice recognition device provided in an embodiment of the present application. For ease of description, only the parts related to the embodiment of the present application are shown.
- the voice recognition device includes:
- the target voice signal acquiring unit 151 is configured to acquire the target voice signal to be recognized
- the target language type recognition unit 152 is configured to determine the target language type of the target speech signal
- the voice recognition unit 153 is configured to input the target language signal into a voice recognition model corresponding to the target language type to obtain the text information output by the voice recognition model;
- the real-time voice recognition model includes the original voice signal And the training set of the extended speech signal;
- the extended speech signal is obtained by converting the existing text based on the basic language type.
- the target language signal is input to the speech recognition model corresponding to the target language type to obtain the The text information output by the speech recognition model;
- the speech recognition model is obtained by training through a training sample set, and the training sample set includes a plurality of extended speech signals, the extended text information corresponding to each extended speech signal, the original speech signal corresponding to each extended speech signal, and The original text information corresponding to each original speech signal, and the extended speech signal is obtained by conversion based on the existing text of the basic language type.
- the voice recognition device further includes:
- An existing text obtaining unit configured to obtain the existing text corresponding to the basic language type
- An extended speech-to-text conversion unit configured to convert the existing text into an extended speech text corresponding to the target language type
- the extended speech signal generating unit is configured to generate the extended speech signal corresponding to the extended speech text based on a speech synthesis algorithm.
- the voice recognition device further includes:
- An asynchronous speech recognition model configuration unit configured to train the first native speech model through the original speech signal in the training set and the original language text corresponding to the original speech signal to obtain an asynchronous speech recognition model
- a pronunciation probability matrix output unit configured to output the pronunciation probability matrix corresponding to the extended speech signal based on the asynchronous speech recognition model
- the real-time speech recognition model configuration unit is configured to train a second native speech model according to the pronunciation probability matrix and the extended speech signal to obtain the real-time speech recognition model.
- the real-time speech recognition model configuration unit includes:
- a quasi-real-time speech model generating unit configured to perform coarse-grained training on the second native speech model according to the pronunciation probability matrix and the extended speech signal to obtain a quasi-real-time speech model
- the real-time speech recognition model generating unit is configured to perform fine-grained training on the quasi-real-time speech model according to the original speech signal and the original language text to obtain the real-time speech recognition model.
- the quasi-real-time speech model generation unit includes:
- a prediction probability matrix generating unit configured to import the extended voice signal into the second native voice model, and determine the prediction probability matrix corresponding to the extended voice signal;
- a loss calculation unit configured to import a preset loss function into the pronunciation probability matrix and the prediction probability matrix, and calculate the loss of the second native speech model
- the quasi-real-time speech recognition model training unit is configured to adjust network parameters in the second native speech model based on the loss amount to obtain the quasi-real-time speech recognition model.
- the loss function is specifically:
- Loss top_k is the loss amount; Is the probability value of the t-th frame and the c-th utterance in the extended speech signal in the prediction probability matrix; Is the probability value of the t-th frame and the c-th pronunciation in the extended speech signal in the pronunciation probability matrix processed by the optimization algorithm; T is the total number of frames; C is the pronunciation recognized in the t-th frame After all the pronunciations of the frame are sorted, the sequence number corresponding to the c-th pronunciation; K is the preset parameter. .
- the first network level in the asynchronous speech recognition model is more than the second network level in the real-time speech recognition model.
- the voice recognition unit 153 includes:
- the speech spectrum corresponding to each of the audio frames is sequentially imported into the real-time speech recognition model, and the text information is output.
- the voice recognition device further includes:
- the training set expansion unit is used to import the target speech signal into the training set corresponding to the target language type
- the speech recognition device provided by the embodiment of the present application can also convert basic language text with a large sample number into an extended speech signal, and use the original speech signal corresponding to the target language type and the extended speech signal to correspond to the target language type in real time.
- the speech recognition model is trained, and the target speech signal is recognized through the trained real-time speech recognition model, and the text information is output, which can increase the number of samples required for training the real-time speech recognition model of non-basic languages, thereby increasing The accuracy and applicability of speech recognition.
- FIG. 16 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
- the terminal device 16 of this embodiment includes: at least one processor 160 (only one is shown in FIG. 16), a processor, a memory 161, and a processor stored in the memory 161 and capable of being processed in the at least one processor.
- the terminal device 16 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
- the terminal device may include, but is not limited to, a processor 160 and a memory 161.
- FIG. 16 is only an example of the terminal device 16 and does not constitute a limitation on the terminal device 16. It may include more or less components than shown in the figure, or a combination of certain components, or different components. , For example, can also include input and output devices, network access devices, and so on.
- the so-called processor 160 may be a central processing unit (Central Processing Unit, CPU), and the processor 160 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (Application Specific Integrated Circuits). , ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the memory 161 may be an internal storage unit of the terminal device 16 in some embodiments, such as a hard disk or a memory of the terminal device 16. In other embodiments, the memory 161 may also be an external storage device of the ** device/terminal device 16, for example, a plug-in hard disk equipped on the terminal device 16, a smart memory card (Smart Media Card, SMC). ), Secure Digital (SD) card, Flash Card, etc. Further, the memory 161 may also include both an internal storage unit of the terminal device 16 and an external storage device.
- the memory 161 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as the program code of the computer program.
- the memory 161 can also be used to temporarily store data that has been output or will be output.
- An embodiment of the present application also provides a network device, which includes: at least one processor, a memory, and a computer program stored in the memory and running on the at least one processor, and the processor executes The computer program implements the steps in any of the foregoing method embodiments.
- the embodiments of the present application also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in each of the foregoing method embodiments can be realized.
- the embodiments of the present application provide a computer program product.
- the steps in the foregoing method embodiments can be realized when the mobile terminal is executed.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- the computer program can be stored in a computer-readable storage medium.
- the computer program can be stored in a computer-readable storage medium.
- the steps of the foregoing method embodiments can be implemented.
- the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
- the computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
- ROM read-only memory
- RAM random access memory
- electric carrier signal telecommunications signal and software distribution medium.
- U disk mobile hard disk, floppy disk or CD-ROM, etc.
- computer-readable media cannot be electrical carrier signals and telecommunication signals.
- the disclosed apparatus/network equipment and method may be implemented in other ways.
- the device/network device embodiments described above are only illustrative.
- the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
- components can be combined or integrated into another system, or some features can be omitted or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
一种基于人工智能语音识别的方法、装置、终端以及存储介质,该方法包括:获取待识别的目标语音信号(S301);确定该目标语音信号的目标语言类型(S302);通过目标语言类型对应的实时语音识别模型,输出目标语音信号的文本信息(S303);该实时语音识别模型是通过包含原始语音信号以及扩展语音信号的训练集训练得到的;该扩展语音信号是基于基础语言类型的已有文本转换得到的。该方法能够增加训练非基础语言的实时语音识别模型训练所需的样本个数,提高语音识别的准确性以及适用性。
Description
本申请要求于2019年12月31日提交国家知识产权局、申请号为201911409041.5、申请名称为“一种语音识别的方法、装置、终端以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请属于数据处理技术领域,尤其涉及一种语音识别的方法、装置、终端以及存储介质。
随着终端设备技术的发展,语音识别技术作为人机交互的重要方式,被应用在多个不同的领域,如何能够提高语音识别的准确性以及适用范围,则变得愈来愈重要。现有的语音识别技术,由于基础语言类型的样本较多,因此识别准确性较高,而对于非基础语言类型,例如方言以及小语种,由于样本数量较小,因此识别准确性低。由此可见,现有的语音识别技术,对于非基础语言的识别准确率低,影响了语音识别技术的适用性。
发明内容
本申请实施例提供了一种语音识别的方法、装置、终端以及存储介质,可以解决现有的语音识别技术,对于非基础语言的识别准确率低以及适用性差的问题。
第一方面,本申请实施例提供了一种语音识别的方法,包括:
获取待识别的目标语音信号;
确定所述目标语音信号的目标语言类型;
将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息;
其中,所述语音识别模型是通过训练样本集训练得到的,所述训练样本集包括多个扩展语音信号、每个扩展语音信号对应的扩展文本信息、每个扩展语音信号对应的原始语音信号以及每个原始语音信号对应的原始文本信息,所述扩展语音信号是基于基础语言类型的已有文本转换得到的。
在第一方面的一种可能的实现方式中,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:
获取所述基础语言类型对应的已有文本;
将所述已有文本转换成所述目标语言类型对应的扩展语音文本;
生成所述扩展语音文本对应的所述扩展语音信号。
在第一方面的一种可能的实现方式中,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:
通过所述训练集中的所述原始语音信号以及与所述原始语音信号对应的原始语言 文本,对第一原生语音模型进行训练,得到异步语音识别模型;
基于所述异步语音识别模型,输出所述扩展语音信号对应的发音概率矩阵;
根据所述发音概率矩阵以及所述扩展语音信号,对第二原生语音模型进行训练,得到所述实时语音识别模型。
在第一方面的一种可能的实现方式中,所述根据所述发音概率矩阵以及所述扩展语音信号,对第二原生语音模型进行训练,得到所述实时语音识别模型,包括:
根据发音概率矩阵以及所述扩展语音信号,对所述第二原生语音模型进行粗粒度训练,得到准实时语音模型;
根据所述原始语音信号以及所述原始语言文本,对所述准实时语音模型进行细粒度训练,得到所述实时语音识别模型。
在第一方面的一种可能的实现方式中,所述根据发音概率矩阵以及所述扩展语音文本,对所述第二原生语音模型进行粗粒度训练,得到准实时语音模型,包括:
将所述扩展语音信号导入所述第二原生语音模型,确定所述扩展语音信号对应的预测概率矩阵;
所述发音概率矩阵以及所述预测概率矩阵导入预设的损失函数,计算所述第二原生语音模型的损失量;
基于所述损失量调整所述第二原生语音模型内的网络参量,得到所述准实时语音识别模型。
在第一方面的一种可能的实现方式中,所述损失函数具体为:
其中,Loss
top_k为所述损失量;
为所述预测概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;
为通过优化算法处理后所述发音概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;T为帧总数;C为第t帧内识别的发音总数;
为所述发音概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;
为基于概率数值由大到小对所述发音概率矩阵中所述扩展语音信号的第t帧的所有发音进行排序后,第c个发音对应的序号;K为预设参数。
在第一方面的一种可能的实现方式中,所述异步语音识别模型内的第一网络层级多于所述实时语音识别模型内的第二网络层级。
在第一方面的一种可能的实现方式中,所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息,包括:
将所述目标语音信号划分为多个音频帧;
分别对各个所述音频帧进行离散傅里叶变换,得到各个所述音频帧对应的语音频谱;
基于帧编号,依次将各个所述音频帧对应的所述语音频谱导入所述实时语音识别模型,输出所述文本信息。
在第一方面的一种可能的实现方式中,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之后,还包括:
将所述目标语音信号导入所述目标语言类型对应的训练集。
第二方面,本申请实施例提供了一种语音识别的装置,包括:
目标语音信号获取单元,用于获取待识别的目标语音信号;
目标语言类型识别单元,用于确定所述目标语音信号的目标语言类型;
语音识别单元,用于将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息;
其中,所述语音识别模型是通过训练样本集训练得到的,所述训练样本集包括多个扩展语音信号、每个扩展语音信号对应的扩展文本信息、每个扩展语音信号对应的原始语音信号以及每个原始语音信号对应的原始文本信息,所述扩展语音信号是基于基础语言类型的已有文本转换得到的。
第三方面,本申请实施例提供了一种终端设备,存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现上述第一方面中任一项所述语音识别的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述第一方面中任一项所述语音识别的方法。
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述语音识别的方法。
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
本申请实施例与现有技术相比存在的有益效果是:
本申请实施例通过样本数量较大的基础语言文本转换为扩展语音信号,并通过目标语言类型对应的原始语音信号以及扩展语音信号对目标语言类型对应的实时语音识别模型进行训练,并通过训练后的实时语音识别模型对目标语音信号进行语音识别,输出文本信息,从而能够增加训练非基础语言的实时语音识别模型训练所需的样本个数,从而提高了语音识别的准确性以及适用性。
图1是本申请实施例提供的手机的部分结构的框图;
图2是本申请实施例的手机的软件结构示意图;
图3是本申请第一实施例提供的一种语音识别的方法的实现流程图;
图4是本申请一实施例提供的语音识别系统的结构示意图;
图5是本申请一实施例提供的语音识别系统的交互流程图;
图6是本申请第二实施例提供的一种语音识别的方法具体实现流程图;
图7是本申请一实施例提供的扩展语音文本的转换示意图;
图8是本申请第三实施例提供的一种语音识别的方法具体实现流程图;
图9是本申请实施例提供的一种异步语音识别模型以及实时语音识别模型的结构 示意图;
图10是本申请第四实施例提供的一种语音识别的方法S803具体实现流程图;
图11是本申请第五实施例提供的一种语音识别的方法S1001具体实现流程图;
图12是本申请一实施例提供的实时语音模型的训练过程的示意图;
图13是本申请第六实施例提供的一种语音识别的方法S303的具体实现流程图;
图14是本申请第七实施例提供的一种语音识别的方法具体实现流程图;
图15是本申请一实施例提供的一种语音识别的设备的结构框图;
图16是本申请另一实施例提供的一种终端设备的示意图。
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本申请实施例提供的语音识别的方法可以应用于手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,还可以应用于数据库、服务器以及基于终端人工智能的服务响应系统,用于响应语音识别请求,本申请实施例对终端设备的具体类型不作任何限制。
例如,所述终端设备可以是WLAN中的站点(STAION,ST),可以是蜂窝电话、 无绳电话、会话启动协议(Session InitiationProtocol,SIP)电话、无线本地环路(Wireless Local Loop,WLL)站、个人数字处理(Personal Digital Assistant,PDA)设备、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、电脑、膝上型计算机、手持式通信设备、手持式计算设备、和/或用于在无线系统上进行通信的其它设备以及下一代通信系统,例如,5G网络中的移动终端或者未来演进的公共陆地移动网络(Public Land Mobile Network,PLMN)网络中的移动终端等。
作为示例而非限定,当所述终端设备为可穿戴设备时,该可穿戴设备还可以是应用穿戴式技术对日常穿戴进行智能化设计、开发出可以穿戴的设备的总称,如眼镜、手套、手表、服饰及鞋等。可穿戴设备即直接穿在身上,或是整合到用户的衣服或配件的一种便携式设备,通过附着与用户身上,采集用户的房颤信号。可穿戴设备不仅仅是一种硬件设备,更是通过软件支持以及数据交互、云端交互来实现强大的功能。广义穿戴式智能设备包括功能全、尺寸大、可不依赖智能手机实现完整或者部分的功能,如智能手表或智能眼镜等,以及只专注于某一类应用功能,需要和其它设备如智能手机配合使用,如各类进行体征监测的智能手环、智能首饰等。
以所述终端设备为手机为例。图1示出的是与本申请实施例提供的手机的部分结构的框图。参考图1,手机包括:射频(Radio Frequency,RF)电路110、存储器120、输入单元130、显示单元140、传感器150、音频电路160、近场通信模块170、处理器180、以及电源190等部件。本领域技术人员可以理解,图1中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图1对手机的各个构成部件进行具体的介绍:
RF电路110可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器180处理;另外,将设计上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路110还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE))、电子邮件、短消息服务(Short Messaging Service,SMS)等,通过RF电路110接收其他终端采集的语音信号,并对语音信号进行识别,输出对应的文本信息。
存储器120可用于存储软件程序以及模块,处理器180通过运行存储在存储器120的软件程序以及模块,从而执行手机的各种功能应用以及数据处理,例如将训练好的实时语音识别算法存储于存储器120内。存储器120可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态 存储器件。
输入单元130可用于接收输入的数字或字符信息,以及产生与手机100的用户设置以及功能控制有关的键信号输入。具体地,输入单元130可包括触控面板131以及其他输入设备132。触控面板131,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板131上或在触控面板131附近的操作),并根据预先设定的程式驱动相应的连接装置。
显示单元140可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单,例如输出语音识别后的文本信息。显示单元140可包括显示面板141,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板141。进一步的,触控面板131可覆盖显示面板141,当触控面板131检测到在其上或附近的触摸操作后,传送给处理器180以确定触摸事件的类型,随后处理器180根据触摸事件的类型在显示面板141上提供相应的视觉输出。虽然在图1中,触控面板131与显示面板141是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板131与显示面板141集成而实现手机的输入和输出功能。
手机100还可包括至少一种传感器150,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板141的亮度,接近传感器可在手机移动到耳边时,关闭显示面板141和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路160、扬声器161,传声器162可提供用户与手机之间的音频接口。音频电路160可将接收到的音频数据转换后的电信号,传输到扬声器161,由扬声器161转换为声音信号输出;另一方面,传声器162将收集的声音信号转换为电信号,由音频电路160接收后转换为音频数据,再将音频数据输出处理器180处理后,经RF电路110以发送给比如另一手机,或者将音频数据输出至存储器120以便进一步处理。例如,终端设备可以通过传声器162,采集用户的目标语音信号,并将转换后的电信号发送给终端设备的处理器进行语音识别。
终端设备可以通过近场通信模块170可以接收其他设备发送的房颤信号,例如该近场通信模块170集成有蓝牙通信模块,通过蓝牙通信模块与可佩戴设备建立通信连接,并接收可穿戴设备反馈的目标语音信号。虽然图1示出了近场通信模块170,但是可以理解的是,其并不属于手机100的必须构成,完全可以根据需要在不改变申请的本质的范围内而省略。
处理器180是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器120内的软件程序和/或模块,以及调用存储在存储器120内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器180可包括一个或多个处理单元;优选的,处理器180可集成应用处理器和调 制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器180中。
手机100还包括给各个部件供电的电源190(比如电池),优选的,电源可以通过电源管理系统与处理器180逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
图2是本申请实施例的手机100的软件结构示意图。以手机100操作系统为Android系统为例,在一些实施例中,将Android系统分为四层,分别为应用程序层、应用程序框架层(framework,FWK)、系统层以及硬件抽象层,层与层之间通过软件接口通信。
如图2所示,所述应用程序层可以一系列应用程序包,应用程序包可以包括短信息,日历,相机,视频,导航,图库,通话等应用程序。特别地,语音识别算法可以嵌入至应用程序内,通过应用程序内的相关控件启动语音识别流程,并处理采集到的目标语音信号,得到对应的文本信息。
应用程序框架层为应用程序层的应用程序提供应用编程接口(applicationprogramming interface,API)和编程框架。应用程序框架层可以包括一些预先定义的函数,例如用于接收应用程序框架层所发送的事件的函数。
如图2所示,应用程序框架层可以包括窗口管理器、资源管理器以及通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
应用程序框架层还可以包括:
视图系统,所述视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供手机100的通信功能。例如通话状态的管理(包括接通,挂断等)。
系统层可以包括多个功能模块。例如:传感器服务模块,物理状态识别模块,三维图形处理库(例如:OpenGL ES)等。
传感器服务模块,用于对硬件层各类传感器上传的传感器数据进行监测,确定手 机100的物理状态;
物理状态识别模块,用于对用户手势、人脸等进行分析和识别;
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
系统层还可以包括:
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
硬件抽象层是硬件和软件之间的层。硬件抽象层可以包括显示驱动、摄像头驱动、传感器驱动、麦克风驱动等,用于驱动硬件层的相关硬件,如显示屏、摄像头、传感器以及麦克风等。特别地,通过麦克风驱动启动麦克风模块,采集用户的目标语音信息,以直线后续的语音识别流程。
需要说明的是,本申请实施例提供的语音识别的方法可以在上述任一层级中执行,在此不做限定。
在本申请实施例中,流程的执行主体为安装有语音识别的程序的设备。作为示例而非限定,语音识别的程序的设备具体可以为终端设备,该终端设备可以为用户使用的智能手机、平板电脑、笔记本电脑、服务器等,对获取得到的语音信号进行识别,并确定该语音信号对应的文本信息,实现将声音信号转换为文本信息的目的。图3示出了本申请第一实施例提供的语音识别的方法的实现流程图,详述如下:
在S301中,获取待识别的目标语音信号。
在本实施例中,终端设备可以通过内置的麦克风模块采集用户的目标语音信号,在该情况下,用户可以通过启动终端设备内的特定应用以激活麦克风模块,例如录音应用、实时通话语音通话应用等;用户还可以通过点击当前应用中的部分控件,以激活麦克风模块,例如在社交应用中点击发送语音的控件,将采集到的语音信号作为交互信息发送给通信对端,此时终端设备会通过麦克风模块采集用户在点击操作过程中产生的语音信号,作为上述的目标语音信号;终端设备内置有输入法应用,该输入法应用支持语音输入功能,用户可以通过点击输入控件以激活终端设备内的输入法应用,并选择语音输入文字功能,此时终端设备可以启动麦克风模块,通过麦克风模块采集用户的目标语音信号,并将目标语音信号转换为文本信息,将该文本信息作为所需输入的参量导入到输入控件。终端设备还可以通过外置的麦克风模块采集用户的目标语音信号,在该情况下,终端设备可以通过无线通信模块或串行接口等方式与外置的麦克风模块建立通信连接,用户可以通过点击麦克风模块上的录音按键,启动麦克风模块采集目标语音信号,并将采集到的目标语音信号通过上述建立的通信连接传输给终端设备,终端设备接收到麦克风模块反馈的目标语音信号后,可以执行后续的语音识别流程。
在一种可能的实现方式中,终端设备除了可以通过麦克风模块获取待识别的目标语音信号外,还可以通过通信对端发送的方式进行获取。终端设备可以通过通信模块 与通信对端建立通信连接,通过通信连接接收通信对端发送的目标语音信号,其中,通信对端采集目标语音信号的方式可以参见上述过程,在此不再赘述。终端设备在接收到通信对端反馈的目标语音信号后,可以对该目标语音信号进行语音识别。以下以一应用场景对上述过程进行解释说明:终端设备A与终端设备B之间基于社交应用程序建立传输交互数据的通信链路,终端设备B通过内置的麦克风模块采集一目标语音信号,并将该目标语音信号通过上述建立的用于传输交互数据的通信链路发送给终端设备A。终端设备A可以通过扬声器模块播放上述目标语音信号,终端设备A的用户可以通过收听的方式获取到交互内容;若终端设备A的用户无法收听目标语音信号,则可以通过点击“文字转换”按钮,识别目标语音信号对应的文本信息,通过输出文本信息的方式显示交互内容。
在一种可能的实现方式中,终端设备在获取得到目标语音信号后,可以通过预设的信号优化算法对目标语音信号进行预处理,从而能够提高后续语音识别的准确性。其中,优化的方式包括但不限于以下一种或多种的组合:信号放大、信号滤波、异常检测、信号修复等。
其中,异常检测具体为根据采集得到的目标语音信号的信号波形,提取多个波形特征参数,例如信噪比、有效语音的持续占比、有效语音的持续时长等,并根据上述采集得到波形特征值计算目标语音信号的信号质量,若检测到该信号质量低于有效信号阈值,则识别目标语音信号为无效信号,不对无效信号执行后续语音识别操作。反之,若该信号质量高于有效信号阈值,则识别目标语音信号为有效信号,执行S302以及S303的操作。
其中,信号修复具体为通过预设的波形拟合算法对采集目标语音信号过程中的中断区域进行波形拟合,生成连续的目标语音信号。该波形拟合算法可以为一神经网络,通过采集目标用户的历史语音信号,对波形拟合算法中的参数进行调整,以使得拟合后的目标语音信号的波形走向与目标用户的波形走向相匹配,从而提高了波形拟合效果。优选地,该信号修复操作在上述异常检测操作之后执行,由于通过信号修改目标语音信号缺失的波形时,会提高目标语音信号的采集质量,从而影响异常检测的操作,从而无法对采集质量较差的异常信号进行识别,基于此,终端设备可以先通过异常检测算法,判断目标语音信号是否有效信号;若该目标语音信号为有效信号,则通过信号修复算法对心电信号进行信号修复;反之,若目标语音信号为异常信号,则无需进行信号修复,从而减少了不必要的修复操作。
在一种可能的实现方式中,终端设备可以通过语音活性检测算法,提取出目标语音信号中的有效语音段,其中,有效语音段具体指的是包含说话内容的语音段,而无效语音段具体指的是在不包含说话内容的语音端。终端设备可以设置语音启动幅值,以及语音结束幅值,其中,语音启动幅值的数值大于语音结束幅值的数值。即有效语音端的启动要求高于有效语音段的结束要求。由于用户在发言的开始时间,往往音量音调较高,此时对应的语音幅值的数值较高;而在用户说话的过程中,部分字符存在弱音或轻音,此时不应该识别用户的说话中断,因此,需要适当降低语音结束幅值,避免误识别的情况发生。终端设备可以根据语音启动幅值以及语音结束幅值,对语音波形图进行有效语音识别,从而划分得到多个有效语音段,其中,该有效语音段的启 示时刻对应的幅值大于或等于语音启动幅值,且结束时刻对应的幅值小于或等于语音结束幅值。在后续识别的过程中,终端设备可以对有效语音段进行语音识别,而无效语音段则无需进行识别,从而可以减少语音识别的信号长度,从而提高了识别效率。
在一种可能的实现方式中,该目标语音信号具体为可以为一音频流,该音频流内包含多个语音帧,其中该音频流的采样率具体为16kHz,即每秒采集16k个语音信号点,且每个信号点通过16比特表示,即位深度为16bit。其中,每个语音帧的帧长为25ms,每个语音帧之间的间隔为10ms。
在S302中,确定所述目标语音信号的目标语言类型。
在本实施例中,终端设备在获取到目标语音信号后,可以通过预设的语言识别算法,确定目标语音信号对应的目标语言类型。由于目标语音信号可以是基于不同的语言类型下的语音信号,不同的语言类型所对应的语音识别算法不同,因此在执行语音识别之前,需要确定该目标语音信号对应的目标语音类型。其中,该目标语言类型可以基于语种进行划分,例如,汉语、英语、俄语、德语、法语以及日语等,还可以基于地域方言类型进行划分,对于汉语而言可以划分为:普通话、粤语、上海话、四川话等,对于日语而言可以划分为:关西腔以及标准日语等。
在一种可能的实现方式中,终端设备可以接收用户输入的地域范围,例如亚洲范围、中国范围或广东范围等,终端设备可以基于用户输入的地域范围,确定该地域内包含的语言类型,并基于该地域范围内的所有语言类型调整语言识别算法。作为示例而非限定,该地域范围为广东范围,而广东范围内包含的语言类型为:粤语、潮汕话、客家话以及普通话,则基于上述四个语言类型,配置对应的语言识别算法。终端设备还可以通过内置的定位装置,获取终端设备采集目标语音信号时的位置信息,并基于该位置信息确定地域范围,从而无需用户手动输入提高了自动化程度。终端设备可以基于上述的地域范围,过滤掉识别概率较低的语言类型,从而能够提高语言识别算法准确性。
在一种可能的实现方式中,该终端设备具体可以为一语音识别服务器。该语音识别服务器可以接收各个用户终端发送的目标语音信号,并通过内置的语言识别算法,确定该目标语音信号的目标语言类型,并从数据库中提取与目标语言类型对应的实时语音识别模型识别目标语音信号对应的文本信息,将文本信息反馈给用户终端。
作为示例而非限定,图4示出了本申请一实施例提供的语音识别系统的结构示意图。参见图4所示,该语音识别系统包括有用户终端41以及语音识别服务器42。用户可以通过用户终端41采集所需识别的目标语音信号,终端设备41可以安装有与语音识别服务器42对应的客户端程序,通过客户端程序与语音识别服务器42建立通信连接,并将采集到的目标语音信号通过客户端程序发送给语音识别服务器42,语音识别服务器42由于采用的是实时语音识别模型,因此可以实时响应用户的语音识别请求,并将语音识别结果通过客户端程序反馈给用户终端41,用户终端41在接收到语音识别结果后,可以通过交互模块,例如显示器或触控屏等,将语音识别结果内的文本信息输出给用户,从而完成语音识别流程。
在一种可能的实现方式中,终端设备可以调用语音识别服务器提供的应用程序接口API,将所需识别的目标语言信号发送给语音识别服务器,通过语音识别服务器内 置的语言识别算法,确定该目标语音信号的目标语言类型,继而选取与目标语言类型对应的语音识别算法,输出目标语音信号的文本信息,并通过API接口将文本信息反馈给终端设备。
在S303中,将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息;
其中,所述语音识别模型是通过训练样本集训练得到的,所述训练样本集包括多个扩展语音信号、每个扩展语音信号对应的扩展文本信息、每个扩展语音信号对应的原始语音信号以及每个原始语音信号对应的原始文本信息,所述扩展语音信号是基于基础语言类型的已有文本转换得到的。
在本实施例中,终端设备在确定了目标语音信号对应的目标语言类型后,可以获取与目标语言类型相对应的实时语音识别模型,其中,终端设备内置的存储器中可以存储有各个不同语言类型的实时语音识别模型,终端设备可以根据目标语言类型的类型编号,从存储器中选取对应的实时语音识别模型;终端设备还可以向云端服务器发送模型获取请求,该模型获取请求内携带有上述识别得到的目标语言类型的类型编号,云端服务器可以将与类型编号对应的实时语音识别模型反馈给终端设备。
在本实施例中,由于不同的语言类型样本的数量差别较多,特别对于基础语言类型,以汉语而言,基础语言类型即为普通话,由于使用的用户数量较多而且使用场合较多,因此可以采集得到的语音样本的数量较多,在对实时语音识别模型进行训练时,由于样本数量大,因此具有较好的训练效果,进而使得基础语言类型的实时语音识别模型的输出准确性较高。而对于非基础语言类型的语言类型,例如地方性方言,对于汉语而言,地方性方言则为区别与普通话而言的其他语言,例如粤语、潮汕话、上海话、北京话以及天津话等,相对于基础语言类型而言,由于上述地区性方言的使用用户的数量较少以及使用场合也较为局限,因此采集到地方性方言的语音信号的样本较少,从而训练的覆盖率较低,进而降低了非基础语言类型的实时语音识别模型的输出准确性。为了平衡不同语言种类的样本数量之间差异,从而提高非基础语言类型的实时语音识别模型的识别准确性,本申请实施例在对实时语音识别模型进行训练时所使用的训练集中,除了原始语音信号外,还包含有扩展语音信号。其中,原始语音信号指的是该信号对应的说话对象所使用的语言类型为目标语言类型,即基于目标语言类型下所说出的语音信号。而扩展语音信号并非真实采集得到的原始信号,而是将基础语言类型所对应的基础语言文本导入至预设的语音合成算法,所输出的合成语音信号。由于以基础语言类型所编辑的基础语言文本的数量较多,因此样本数较多,能够提高训练的覆盖率。例如,大部分中文书籍、通知以及网路文章等,均是基于普通话为阅读语言而撰写的,而以粤语或东北话等地方性方言为阅读语言的文本量则较少,因此,可以基于上述基础语言类型对应的基础语言文本,转换为扩展语言信号,以扩大对于非基础语言类型的样本数量。
在一种可能的实现方式中,获取原始语音信号的方式可以为:终端设备可以从多个预设的云端服务器处下载目标语言类型的语料库,该语料库内存储有关于目标语言类型的多个历史语音信号。终端设备将所有历史语音信号进行整理,并将整理后的历史语音信号作为训练集内的原始语音信号。上述的历史语音信号可以从视频文件的音 频数据中截图得到,例如某一电影文件的标签中包含有配音语种,若该配音语种与目标语言类型相匹配,则电影文件中的音频数据即是基于目标语言类型的语音信号录制得到的,因此可以从电影文件中的音频数据获取上述的原始语音信号。当然,其他已有文件若携带有目标语言类型的标签,也可以从已有文件中提取原始语音信号。
在一种可能的实现方式中,生成扩展语音信号的方式可以为:终端设备可以通过语义识别算法对基础语言类型的已有文本进行语义分析,确定该已有文本内包含的文本关键词,并确定各个文本关键词在目标语言类型对应的关键词译名,并获取各个关键词译名对应的译名发音,基于所有关键词译名的译名发音生成上述的扩展文本。
作为示例而非限定,图5示出了本申请一实施例提供的语音识别系统的交互流程图。参见图5所示,该语音识别系统包括有用户终端以及语音识别服务器。该语音识别服务器包括有多个不同的模块,分别为语言类型识别模块以及对应不同语言类型的实时语音识别模块,其中,该实时语音识别模块内包含有基础语言类型的实时语音识别模块以及地方性方言的实时语音识别模块。用户终端采集得到用户的目标语音信号后,将其发送给语音识别服务,通过语音识别服务器内的语音类型识别模块确定该目标语音信号的目标语言类型,并传输给对应的实时语音识别模块进行语音识别,以输出对应的文本信息,并将输出的文本信息反馈给用户终端。
在本实施例中,终端设备可以通过原始语音信号以及通过基础语言类型的已有文本转换得到的扩展语音信号对原生语音识别模型进行训练,当原生语音识别模型的识别结果收敛且对应的损失函数小于预设的损失阈值,则识别该原生语音识别模型已调整完毕,此时可以将调整后的原生语音识别模型作为上述的实时语音识别模型,以响应发起的语音识别操作。
随着智能移动设备的普及,语音识别(Automatic Speech Recognition,ASR)技术作为一种新的人机交互方式,开始得到大规模应用。大量的应用场景下可以基于语音识别技术提供多项服务,例如智能语音助手、语音输入法和文本转写系统等等。近年来,深度学习的发展极大提升了ASR技术的识别准确率,目前大部分ASR系统可以以深度学习模型为基础进行系统搭建。然而深度学习模型需要依赖海量的数据,即训练语料,来提高识别准确率。训练语料的来源主要是通过人工进行标注,然而上述方式人工成本非常高昂,阻碍了ASR技术的发展。出了主动标注的方式外,ASR模型在使用过程中还可以收集到的大量的用户数据,若能够通过自动化的方式将这些数据进行标注,则可以大规模扩充训练语料的数量,从而提高语音识别的准确性。在面向海量用户时,由于不同用户使用的语言类型不同,则要求ASR模型能够通过自学习的方式适应不同搞得语言类型,从而达到对所有语言类型均具有较高的识别准确率。而由于地方性方言的用户样本较少,因此会导致部分方言的训练语料不足,从而影响对该类方言的识别率。然而现有的实时语音识别模型,针对各种方言的样本数量严重不平衡,基础性语言的样本占大多数,而某些方言样本稀少,难以对方言的识别率有针对性的提升。由于实时语音识别领域中,虽然用户数据数量较大,无法全部进行人工标注,利用机器自动标注又可能会引入误差,这些误差会导致模型在自学习的过程中发生偏移,从而降低模型性能。
在一种可能的实现方式中,根据语音信号采集的地域信息,来配置不同的实时语 音识别模型,从而能够基于省份或市区等行政地域划分规则,对实时语音识别模型进行训练,以实现有针对性地模型训练。然而上述方式依赖省份区分口音无法做到对口音的精细建模,由于部分省份方言差异非常大,同一省份内的方言有完全不同的发音方式甚至短语,无法保证省份的口音一致性,导致实时语音训练的粒度较大,降低了识别的准确性;另一方面,某些方言具有较多的人群,如粤语、上海话,上述人群可以分布于多个不同省份,从而导致了上述无法针对特定的方言进行针对性优化,降低了识别的准确性。
与上述实现方式不同的是,本实施例提供的方式可以利用基础语言类型的样本数量庞大、覆盖性高的特点,将基础语言类型的已有文本转换为目标语言类型的扩展语言信号,由于上述转换方式是定向转换的,因此生成的扩展语言信号必然是基于目标语言类型的语音信号,从而无需用户进行手动标记,减少了人力成本的同时,也能够为地方性方言提供大量的训练语料,从而实现了不同语言类型的样本均衡,提高了训练操作的准确性。
以上可以看出,本申请实施例提供的一种语音识别的方法通过样本数量较大的基础语言文本转换为扩展语音信号,并通过目标语言类型对应的原始语音信号以及扩展语音信号对目标语言类型对应的实时语音识别模型进行训练,并通过训练后的实时语音识别模型对目标语音信号进行语音识别,输出文本信息,从而能够增加训练非基础语言的实时语音识别模型训练所需的样本个数,从而提高了语音识别的准确性以及适用性。
图6示出了本申请第二实施例提供的一种语音识别的方法的具体实现流程图。参见图6,相对于图3所述实施例,本实施例提供的一种语音识别的方法中在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:S601~S603,具体详述如下:
进一步地,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:
在S601中,获取所述基础语言类型对应的已有文本。
在本实施例中,由于基础语言类型的使用范围广且使用人群较多,因此互联网以及云端数据库内存储有的基于基础语言类型作为记载语言的文本数量较多,终端设备可以从云端数据库的文本库内提取基础语言类型的已有文本,还可以从互联网中进行数据爬取,获取多个记载语言使用的是基础语言类型的文本,以获取得到上述的已有文本。
在一种可能的实现方式中,终端设备在响应用户发起的语音识别操作时,获取到用户发送的历史语音信号,若检测到该历史语音信号对应的语言类型为基础语言类型,则可以将该历史语音信号生成的历史文本,作为上述基于基础语言类型所记载的已有文本,从而实现自采集训练数据的目的,提高了训练样本数,继而提高了实时语音识别模型的识别准确性。
在一种可能的实现方式中,不同的目标语言类型所对应的基础语言类型不同,终端设备可以建立基础语言对应关系,确定不同的目标语言类型所关联的基础语言类型。需要说明的是,一个目标语言类型对应一个基础语言类型,而一个基础语言类型可以 对应多个目标语言类型。例如,对于汉语语种而言,其基础语言类型为普通话,则属于汉语语种这一大类的所有语言类型,其对应的基础语言类型为普通话;而对于英语语种而言,其基础语言类型为英式英语,则属于英语语种这一大类的所有语言类型,其对应的基础语言类型为英式英语,从而可以确定不同的语言类型与基础语言类型之间的对应关系。终端设备可以根据上述建立的基础语言对应关系,确定目标语言类型对应的基础语言类型,并获取该基础语言类型的已有文本。
在S602中,将所述已有文本转换成所述目标语言类型对应的扩展语音文本。
在本实施例中,终端设备可以根据基础语言类型以及目标语言类型,确定两者之间的翻译算法,并将已有文本导入到翻译算法内,生成扩展语音文本。由于已有文本是基于基础语言类型记载的,里面词汇以及语法是根据基础语言类型所确定的,而不同的语言类型,所使用的词汇以及语法会存在差异,为了能够提高后续扩展语音信号的准确性,终端设备并非直接根据已有文本生成对应的合成语音,而是首先对已有文本进行翻译,从而能够生成符合目标语言类型的语法结构以及用词规范的扩展语音文本,以提高后续识别的准确性。
在一种可能的实现方式中,终端设备在转换得到扩展语音文本后,可以对上述翻译的正确性进行校验。终端设备可以通过语义分析算法,确定已有文本内包含的各个实体,并获取各个实体在目标语言类型对应的译名;检测各个译名是否在转换后的扩展语音文本中,若各个译名均在扩展语音文本内,则识别各个译名的之间的相互位置关系,基于相互位置关系确定译名之间是否符合目标语言类型的语法结构,若所述相互位置关系均满足所述语法结构,则识别翻译正确;反之,若所述相互位置关系不满足所述语法结构和/或,所述译名不包含在所述扩展语音文本内,则识别翻译失败,需要重新调整上述翻译算法。
在S603中,生成所述扩展语音文本对应的所述扩展语音信号。
在本实施例中,终端设备可以通过语音合成算法,获取扩展语音文本内各个字符对应的标准读音,并通过语义识别算法,确定扩展语音文本内包含的词组,确定每个词组之间的词间间隔时长以及词组内不同字符之间的字符间隔时长,根据词间间隔时长、字符间隔时长以及各个字符对应的标准读音,生成扩展语音文本对应的扩展语音信号,从而生成了以目标语言类型为会话语言的扩展语音信号。
在一种可能的实现方式中,终端设备可以为不同的目标语言类型建立对应的语料库。每个语料库记录有该目标语言类型的多个基础发音。终端设备在获取得到目标语言类型对应的字符后,可以确定该字符内包含的基础发音,基于多个基础发音进行合并以及变换,得到该字符对应的标准发音,从而可以基于每个字符对应的标准发音,生成扩展语音信号。
作为而非限定,图7示出了本申请一实施例提供的扩展语音文本的转换示意图。终端设备获取的一已有文本为“我这里没有你想要的”,其对应的基础语言类型为普通话,而目标语言类型为粤语,则终端设备可以通过普通话与粤语之间的翻译算法,将上述已有文本翻译为基于粤语的扩展语音文本,得到的翻译结果为“我呢度冇你想要嘅”,并将上述扩展语音文本导入到粤语的语音合成算法,得到对应的扩展语音信号,得到一个用于表示“我这里没有你想要的”意思的扩展语音信号,实现了样本扩 充的目的。
在本申请实施例中,通过获取基础语言类型对应的已有文本,并对已有文本进行转换得到扩展语音文本,能够实现对样本数较少的非基础语言的样本扩充,提高了实时语音识别模型的训练效果,从而提高了识别准确性。
图8示出了本申请第三实施例提供的一种语音识别的方法的具体实现流程图。参见图8,相对于图3所述实施例,本实施例提供的一种语音识别的方法中在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:S801~S803,具体详述如下:
进一步地,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:
在S801中,通过所述训练集中的所述原始语音信号以及与所述原始语音信号对应的原始语言文本,对第一原生语音模型进行训练,得到异步语音识别模型。
在本实施例中,终端设备可以配置有两种不同的语音识别模型,分别为能够响应实时语音识别操作的实时语音识别模型以及需要较长响应时间的异步语音识别模型。其中,实时语音识别模型可以是基于神经网络搭建的,搭建上述实时语音识别模型的神经网络的网络层级较少,从而具有较快的响应效率,但同时,识别的准确率低于异步语音识别模型;而异步语音识别模型也可以是基于神经网络搭建的,搭建上述异步语音识别模型的神经网络的网络层级较多,从而识别所需时长较长,响应效率较低,但同时,识别的准确率高于实时语音识别模型。在该情况下,通过异步语音识别模型来对实时语音识别模型的训练过程进行数据纠偏,从而提高实时语音识别模型的准确性。
在一种可能的实现方式中,实时语音识别模型与异步语音识别模型可以是基于同类结构的神经网络搭建的,也可以是不同类结构的神经网络搭建的,在此不做限定。因此,用于构建实时语音识别模型的第二原生语音模型以及用于构建异步语音识别模型的第一原生语音模型之间,也可以是基于同类结构的神经网络搭建的,也可以是不同类结构的神经网络搭建的,在此不做限定。
在本实施例中,由于异步语音识别模型具有更好的识别准确性以及更久的收敛时长,在样本较少的情况下,也能够保证数据训练的效果。而原始语音信号是未经转换得到的语音信号,该原始语音信号内各个字节的发音会根据用户不同而存在一定的差异,因此对于测试过程而言具有较高的覆盖率,并且由于用户的发音会与标准发音存在偏差,也能够对后续训练过程进行识别纠正。基于上述原因,终端设备可以将原始语音信号以及与原始语音信号对应的原始语言文本作为训练样本,对第一原生语音模型进行训练,并将训练结果收敛且模型的损失量小于预设的损失阈值时对应的网络参数,作为训练后的网络参数,基于训练后的网络参数配置第一原生语音模型,得到上述的异步语音识别模型。其中,计算上述第一原生语音模型采用的损失量的函数可以为时序连接性分类损失函数(Connectionist Temporal Classification Loss,CTC Loss),该CTC Loss具体可以表示为:
Loss
ctc=-∑
(x,z)∈Sln p(z|x)
其中,Loss
ctc为上述的损失函数;x为原始语音信号;z为原始语音信号对应的 原始语言文本;S为所有原始语音信号构成的训练集合;p(z|x)为基于原始语音信号输出原始语言文本的概率值。
进一步地,作为本申请的另一实施例,所述异步语音识别模型内的第一网络层级多于所述实时语音识别模型内的第二网络层级。
在本实施例中,上述两个语音识别模型具体为基于同类结构的神经网络搭建的语音识别模型,而异步语音识别模型包含的第一网络层级多于实时语音识别模型的第二网络层级,从而异步语音识别模型具有更优的识别准确性,但语音识别操作的时长较长,从而适用于非实时的异步响应场景。举例性地,各个不同用户可以将所需执行语音识别的音频文件发送给终端设备,终端设备可以将上述音频文件导入到异步语音识别模型,此时,用户终端与终端设备可以将通信链路配置为长连接链路,并以预设的时间间隔检测异步语音识别模型的运行情况,在长连接过程中,用户终端与终端设备之间维护该通信链路的开销较少,从而降低了终端设备接口的资源占用量;若检测到异步语音识别模型输出上述音频文件的语音识别结果后,可以将通过上述长连接链路将语音识别结果发送给用户终端,此时可以动态调整该长连接的网络资源占用量,从而提高语音识别记过的发送速度。在该情况下,异步语音识别模型可以将各个语音识别任务添加到预设的任务列表内,并基于各个语音识别任务的添加次序,依次进行处理,并将各个语音识别结果发送给各个用户终端。而实时语音识别模型可以实时响应用户发送的语音识别请求,在该情况下,用户终端与终端设备之间可以建立实时传输链路,用户终端在采集语音信号的过程中,会语音信号对应音频流实时传输给终端设备,终端设备将音频流导入到实时语音识别模型,即用户终端一边采集用户的语音信号的同时,实时语音识别模型可以语音信号中已反馈的音频帧进行语音识别,在用户的语音信号采集完成后,用户终端可以将完整的音频流发送给终端设备,终端设备将后续接收到且未识别的剩余音频帧传输给实时语音识别模型,从而生成语音识别结果,即文本信息,并反馈给用户终端,实现了实时响应用户发起的语音识别请求的目的。
作为示例而非限定,图9示出了本申请实施例提供的一种异步语音识别模型以及实时语音识别模型的结构示意图。参见图9所示,实时语音识别模型与异步语音识别模型属于相同网络结构的神经网络,包含有频率特征提取层、卷积层CNN、循环神经网络层Bi-RNN以及全连接层。其中,实时语音识别模型与异步语音识别模型中频率特征提取层以及全连接层的层数相同,均为一层。其中,频率特征提取层可以将音频流转换得到的语音频谱提取频谱特征值,得到频率特征矩阵;而全连接层可以将上述输入的层级最后输出的特征向量,确定各个音频帧的多个发音概率,生成发音概率矩阵,并基于发音概率矩阵输出语音信号对应的文本信息。其中,实时语音识别模型中包含有两层卷积层,四层循环神经网络层;异步语音识别模型中包含三层卷积层以及九层循环神经网络层。通过多个卷积层以及循环神经网络层,具有更优的特征提取特征,进而提高识别的准确性,但相对于,网络层级越多,则运算所需时长越长,因此,实时语音识别模型需要平衡识别准确性以及响应时长,配置的网络层级个数会少于异步语音识别模型。
在本申请实施例中,通过配置更多网络层级于异步语音识别模型内,能够提高异步语音识别模型的识别准确性,从而能够对后续实时语音识别模型的训练过程进行监 督以及纠正,从而能够提高实时语音识别模型的识别准确性。
在S802中,基于所述异步语音识别模型,输出所述扩展语音信号对应的发音概率矩阵。
在本实施例中,终端设备配置了异步语音识别模型后,可以分别将各个扩展语音信号导入到上述的异步语音识别模型,生成各个扩展语音信号对应的发音概率矩阵。由于扩展语音信号具体由不同的语音帧构成,不同的语音帧对应一个发音,而由于语音识别模型最后的全连接层,是用于输出不同发音的概率值,因此每个语音帧可以对应多个不同的候选发音,不同候选发音对应不同的概率值,然后可以根据各个发音对应的字符之前的上下文关联度以及每个字符的概率值,最后生成对应的文本信息。基于此,不同语音帧可以对应多个不同发音,而不同的发音对应不同的概率值。将各个语音帧对应的候选语音进行整合,则可以生成一个发音概率矩阵。
作为实力而非限定,表1示出了本申请一实施例提供的发音概率矩阵的示意图。参见表1所示,该扩展语音信号包含有四个语音帧,分别为T1~T4,每个语音帧可以用于表示一个字符。对于第一语音帧T1经过异步语音识别模型识别后,对应4个不同的候选发音,分别为“xiao”、“xing”、“liao”以及“liang”,且每个发音对应的概率值为61%、15%、21%以及3%。依次类推,后续各个语音帧也具有多个候选字符,且不同候选字符对应一个发音概率。
| T1 | T2 | T3 | T4 |
| Xiao 61% | Ye 11% | Liao 22% | Yi 70% |
| Xing 15% | Yi 54% | Xing 19% | Ye 9% |
| Liao 21% | Yan 8% | Xiao 49% | Ya 21% |
| Liang 3% | Ya 14% | Liang 10% | |
| Yin 13% |
表1
在S803中,根据所述发音概率矩阵以及所述扩展语音信号,对第二原生语音模型进行训练,得到所述实时语音识别模型。
在本实施例中,终端设备可以联合异步语音识别模型以及已有的训练样本对第二原生语音模型进行训练,以得到实时语音识别模型,从而能够提高实时语音识别模型的识别准确性。其中,异步语音识别模型具体的作用是对第二原生语音模型的训练过程进行监督以及预测纠正,从而提高第二原生语音模型的训练效率以及准确性,以得到实时语音识别模型。
需要说明的是,由于通过配置训练集对模型进行训练的过程中,训练集内每个输入只对应一个标准输出结果,特别对于语音识别过程,由于用户的语音语调或者采集过程中存在噪声,同样字符不同用户或不同场景下发音之间差异较大,因此识别得到的输出结果可能存在多个候选发音,若只是对应一个标准输出结果,并根据标准输出结果进行训练,则无法确定语音预测的方向是否准确,从而降低了训练的准确性。为了解决上述问题,本申请引入了异步语音识别模型对实时语音识别模型的语音预测方向进行纠偏,通过配置有多个不同候选发音的发音概率矩阵,基于上述的发音概率矩阵对实时语音识别模型进行训练,由于异步语音识别模型具有更高的准确性以及可靠 性,从而能够保证实时语音识别模型的语音预测方向与异步语音识别模型的语音识别方向一致,从而提高了实时语音识别模型的准确性。
在一种可能的实现方式中,训练第二原生语音模型的过程具体可以为:将扩展语音信号导入到上述的第二原生语音模型,并生成对应的预测发音矩阵。通过发音概率矩阵以及预测发音矩阵,确定差异的候选发音以及相同候选发音之间的偏差值,计算两个矩阵之间的偏差率,基于所有偏差率确定第二原生语音模型的损失量,从而基于损失量对第二原生语音模型进行调整。其中,损失量的计算函数仍可以采用CTC Loss函数进行计算,具体函数公式可以参照上述论述,在此不再赘述,其中,函数中的z为上述的发音概率矩阵,p(z|x)为输出上述发音概率矩阵的概率值。
在本申请实施例中,通过对异步语音识别模型进行训练,并基于异步语音识别模型监督实时语音识别模型的训练过程,从而提高训练效果,实现了语音识别的纠偏,提高了实时语音识别模型的准确性。
图10示出了本申请第四实施例提供的一种语音识别的方法S803的具体实现流程图。参见图10,相对于图8所述实施例,本实施例提供的一种语音识别的方法中S803包括:S1001~S1002,具体详述如下:
进一步地,所述根据所述发音概率矩阵以及所述扩展语音信号,对第二原生语音模型进行训练,得到所述实时语音识别模型,包括
在S1001中,根据发音概率矩阵以及所述扩展语音信号,对所述第二原生语音模型进行粗粒度训练,得到准实时语音模型。
在本实施例中,对于第二原生语音模型的训练过程分为两个部分,一个是粗粒度训练过程,另一个是细粒度训练过程。其中,粗粒度训练过程则具体通过异步语音识别模型生成的发音概率矩阵进行语音纠错以及监督。在该情况下,终端设备可以将将扩展语音信号作为第二原生语音模型的训练输入,将发音概率矩阵作为第二原生语音模型的训练输出,对第二原生语音模型进行模型训练,直到第二原生语音模型的结果收敛,且对应的损失函数小于预设的损失阈值,此时识别第二原生语音模型训练完成,将训练完成的第二原生语音模型识别为准实时语音模型,以便执行下一步的细粒度训练操作。
在一种可能的实现方式中,对第二原生语音模型进行粗粒度训练的过程具体可以为:将上述扩展语音信号划分为多个训练组,该训练组包含一定数量的扩展语音信号以及与该扩展语音信号关联的发音概率矩阵。终端设备分别通过各个训练组对上述的第二原生语音模型进行训练,并在训练后,通过预设的原始语音信号作为验证集导入到每次训练后的第二原生语音模型,计算关于各个验证集的偏差率,终端设备将偏差率最小时所述第二原生语音模型的网络参量,作为训练完成的网络参量,并基于训练完成的网络参量导入到第二原生语音模型,从而得到上述的准实时语音模型。
在S1002中,根据所述原始语音信号以及所述原始语言文本,对所述准实时语音模型进行细粒度训练,得到所述实时语音识别模型。
在本实施例中,终端设备在生成了准实时语音识别模型后,可以进行二次训练,即上述的细粒度训练,其中,细粒度训练采用的训练数据为原始语音信号以及与原始语音信号对应的原始语音文本。原始语音信号是未经转换得到的语音信号,该原始语 音信号内各个字节的发音会根据用户不同而存在一定的差异,因此对于测试过程而言具有较高的覆盖率,并且由于用户的发音会与标准发音存在偏差,也能够对后续训练过程进行识别纠正。基于上述原因,终端设备可以将原始语音信号以及与原始语音信号对应的原始语言文本作为训练样本,对准实时语音模型进行训练,并将训练结果收敛且模型的损失量小于预设的损失阈值时对应的网络参数,作为训练后的网络参数,基于训练后的网络参数配置准实时语音模型,得到上述的实时语音识别模型。其中,计算上述准实时语音模型采用的损失量的函数可以为时序连接性分类损失函数(Connectionist Temporal Classification Loss,CTC Loss),该CTC Loss具体可以表示为:
Loss
ctc=-∑
(x,z)∈Sln p(z|x)
其中,Loss
ctc为上述的损失函数;x为原始语音信号;z为原始语音信号对应的原始语言文本;S为所有原始语音信号构成的训练集合;p(z|x)为基于原始语音信号输出原始语言文本的概率值。
在本申请实施例中,通过两个阶段对第二原生语音模型进行训练,从而生成实时语音识别模型,通过扩展语音信息扩展了训练样本以及采用异步语音识别模型对训练过程进行纠偏,提高了训练的准确性。
图11示出了本申请第五实施例提供的一种语音识别的方法S1001的具体实现流程图。参见图11,相对于图10所述实施例,本实施例提供的一种语音识别的方法中S1001包括:S1101~S1103,具体详述如下:
进一步地,所述根据发音概率矩阵以及所述扩展语音文本,对所述第二原生语音模型进行粗粒度训练,得到准实时语音模型,包括:
在S1101中,将所述扩展语音信号导入所述第二原生语音模型,确定所述扩展语音信号对应的预测概率矩阵。
在本实施例中,终端设备可以将扩展语音信号作为训练的输入量,导入到第二原生语音模型,该原生第二原生语音模型可以确定扩展语音信号内各个语音帧对应的候选发音,以及各个候选发音的判定概率,将所有语音帧对应的候选发音以及关联的判定概率生成预测概率矩阵。其中,预测概率矩阵的结构与发音概率矩阵的结构一致,具体描述可以参见上述实施例的描述。在此不做赘述。
在S1102中,所述发音概率矩阵以及所述预测概率矩阵导入预设的损失函数,计算所述第二原生语音模型的损失量。
在本实施例中,每个扩展语音信号对应两个概率矩阵,分别为基于第二原生语音识别模型输出的预测概率矩阵,以及基于异步语音识别模型输出的发音概率矩阵,终端设备可以将所有扩展语音信号对应的两个概率矩阵导入到预设的损失函数内,计算第二原生语音模型的损失量。若预测概率矩阵内各个候选发音以及对应概率值与发音概率矩阵的匹配程度越高,则对应的损失量的数值越少,从而可以根据损失量确定第二原生语音识别模型的识别准确性。
进一步地,作为本申请的另一实施例,所述损失函数具体为:
其中,Loss
top_k为所述损失量;
为所述预测概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;
为通过优化算法处理后所述发音概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;T为帧总数;C为第t帧内识别的发音总数;
为所述发音概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;
为基于概率数值由大到小对所述发音概率矩阵中所述扩展语音信号的第t帧的所有发音进行排序后,第c个发音对应的序号;K为预设参数。
在本实施例中,上述的损失函数具体用于训练第二原生语音模型学习异步语音识别模型中前K个概率值较大的发音,而对于概率值较小的发音则无需进行学习,因此,对于前K个概率值较大的发音,其对应的概率值保持不变,即为
而对于除前K个外的其他发音,其优化后的概率值即为0,对应的学习效率为0,从而能够实现对第二原生语音模型的语音识别纠偏,提高了纠正效果的同时,能够兼顾纠正效率,无需学习其他概率较低的无效发音预测行为。
作为而非限定,表2示出了本申请提供的一种通过优化算法处理后的发音概率矩阵。其中,优化钱的发音概率矩阵可以参见表1所示,表1中的发音概率矩阵中各个发音并不会根据概率值的大小进行排序。优化算法中配置的K值为2,则第二原生训练模型对概率值最高的前两个发音进行预测学习。其中,
表示第一帧第一个发音的概率值,即为“xiao”的发音概率,为61%,而由于该概率值在第一帧内所有发音的概率值的数值最大,因此对应的排序为1,即
为1小于等于K的值,因此对该发音概率进行学习,即
与
相同,为61%;而
表示第二帧第二个发音的概率值,即为“xing”的发音概率,为15%,由于该概率值在第一帧内所有发音的概率值的由大到小排序后为第3个,即
为3大于K的值,因此不对该发音概率进行学习,即
与
不相同,为0,以此类推,从而得到通过优化算法处理后的发音概率矩阵。
| T1 | T2 | T3 | T4 |
| Xiao 61% | Ye 11% | Liao 22% | Yi 70% |
| Xing 15% | Yi 54% | Xing 19% | Ye 9% |
| Liao 21% | Yan 8% | Xiao 49% | Ya 21% |
| Liang 3% | Ya 14% | Liang 10% | |
| Yin 13% |
表2
在本申请实施例中,通过采用Top-K的方式确定损失函数,从而能够对概率较高的发音预测进行学习,兼顾了训练准确性的同时,能够提高收敛速度,从而提高了训练效果;并且还达到对异步语言识别模型输出的发音概率矩阵压缩的目的,节省存储 空间。
在S1103中,基于所述损失量调整所述第二原生语音模型内的网络参量,得到所述准实时语音识别模型。
在本实施例中,终端设备可以根据上述的损失量对第二原生语音模型进行调整,在上述损失量小于预设的损失阈值且结果收敛时对应的网络参量,作为训练完成的网络参量,基于训练完成的网络参量配置第二原生语音模型,得到准实时语音识别模型。
作为示例而非限定,图12示出了本申请一实施例提供的实时语音模型的训练过程的示意图。参见图12所示,该训练过程包括三个阶段,分别为预训练阶段、粗粒度训练阶段以及细粒度训练阶段,其中,预训练阶段是基于原始语音信号以及原始语言文本对异步语音模型进行训练,其中训练过程中采用的损失函数可以为CTC Loss函数;而对于粗粒度训练阶段,可以通过训练后的异步语音模型输出扩展语音信号的发音概率矩阵,并基于发音概率矩阵以及扩展语音信号对准实时语音模型进行训练,其中训练过程中采用的损失函数可以为Top-K CE Loss函数;细粒度训练阶段是基于原始语音信号以及原始语言文本对实时语音模型进行训练,其中训练过程中采用的损失函数可以为CTC Loss函数。
在本申请实施例中,通过计算两个概率矩阵之间的偏差值,确定第二原生语音模型与异步语音识别模型之间的识别损失量,从而能够实现基于异步语义识别模型对第二原生语音模型进行纠偏的目的,提高了训练的准确性。
图13示出了本申请第六实施例提供的一种语音识别的方法S303的具体实现流程图。参见图13,相对于图3、图6、图8、图10以及图11任一所述实施例,本实施例提供的一种语音识别的方法S303包括:S1301~S1303,具体详述如下:
进一步地,所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息,包括:
在S1301中,将所述目标语音信号划分为多个音频帧。
在本实施例中,语音信号可以有多个不同音频帧构成,不同音频帧具有预设的帧长,而各个音频帧之间具有一定的帧间隔,基于帧间隔对各个音频帧进行排列,则得到上述一段完整的音频流。终端设备可以根据预设的帧间隔以及帧长,对目标语音信号进行划分,从而得到多个音频帧。每个音频帧可以对应一个字符对应的发音。
在S1302中,分别对各个所述音频帧进行离散傅里叶变换,得到各个所述音频帧对应的语音频谱。
在本实施例中,终端设备可以通过离散傅里叶变换,实现时域到频域的转换,从而得到各个音频帧对应的语音频段,可以根据语音频段确定各个发音的发音频率,从而根据发音频率确定该各个发音对应的字符。
在S1303中,基于帧编号,依次将各个所述音频帧对应的所述语音频谱导入所述实时语音识别模型,输出所述文本信息。
在本实施例中,终端设备可以根据各个音频帧在目标语言信号内关联的帧编号,依次将各个音频帧转换得到的语音频谱导入到实时语音识别模型,该实时语音识别模型可以输出各个音频帧对应的发音概率,并基于各个候选发音概率以及上下文关联度,生成对应的文本信息。
在本申请实施例中,通过对目标语音信号进行预处理,得到目标语音信号内各个音频帧的语音频谱,从而能够提高实时语音识别模型的数据处理效率,提高了识别效率。
图14示出了本申请第七实施例提供的一种语音识别的方法的具体实现流程图。参见图14,相对于图3、图6、图8、图10以及图11任一所述实施例,本实施例提供的一种语音识别的方法在S303之后,还包括:S1401,具体详述如下:
进一步地,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之后,还包括:
在S1401中,将所述目标语音信号导入所述目标语言类型对应的训练集。
在本实施例中,终端设备在输出目标语音信号对应的文本信息后,可以将目标语音信号以及对应的文本信息导入到训练集中,从而实现了训练集的自动扩充。
在本申请实施例中,通过自动标记目标语言信号的目标语言类型的方式,增加训练集中的样本个数,实现了自动扩充样本集的目的,提高训练操作的准确性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的语音识别的方法,图15示出了本申请实施例提供的语音识别的装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图15,该语音识别的装置包括:
目标语音信号获取单元151,用于获取待识别的目标语音信号;
目标语言类型识别单元152,用于确定所述目标语音信号的目标语言类型;
语音识别单元153,用于将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息;所述实时语音识别模型是通过包含原始语音信号以及扩展语音信号的训练集训练得到的;所述扩展语音信号是基于基础语言类型的已有文本转换得到的将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息;
其中,所述语音识别模型是通过训练样本集训练得到的,所述训练样本集包括多个扩展语音信号、每个扩展语音信号对应的扩展文本信息、每个扩展语音信号对应的原始语音信号以及每个原始语音信号对应的原始文本信息,所述扩展语音信号是基于基础语言类型的已有文本转换得到的。
可选地,所述语音识别的装置还包括:
已有文本获取单元,用于获取所述基础语言类型对应的所述已有文本;
扩展语音文本转换单元,用于将所述已有文本转换成所述目标语言类型对应的扩展语音文本;
扩展语音信号生成单元,用于基于语音合成算法,生成所述扩展语音文本对应的所述扩展语音信号。
可选地,所述语音识别的装置还包括:
异步语音识别模型配置单元,用于通过所述训练集中的所述原始语音信号以及与所述原始语音信号对应的原始语言文本,对第一原生语音模型进行训练,得到异步语 音识别模型;
发音概率矩阵输出单元,用于基于所述异步语音识别模型,输出所述扩展语音信号对应的发音概率矩阵;
实时语音识别模型配置单元,用于根据所述发音概率矩阵以及所述扩展语音信号,对第二原生语音模型进行训练,得到所述实时语音识别模型。
可选地,所述实时语音识别模型配置单元包括:
准实时语音模型生成单元,用于根据发音概率矩阵以及所述扩展语音信号,对所述第二原生语音模型进行粗粒度训练,得到准实时语音模型;
实时语音识别模型生成单元,用于根据所述原始语音信号以及所述原始语言文本,对所述准实时语音模型进行细粒度训练,得到所述实时语音识别模型。
可选地,所述准实时语音模型生成单元包括:
预测概率矩阵生成单元,用于将所述扩展语音信号导入所述第二原生语音模型,确定所述扩展语音信号对应的预测概率矩阵;
损失量计算单元,用于所述发音概率矩阵以及所述预测概率矩阵导入预设的损失函数,计算所述第二原生语音模型的损失量;
准实时语音识别模型训练单元,用于基于所述损失量调整所述第二原生语音模型内的网络参量,得到所述准实时语音识别模型。
可选地,所述损失函数具体为:
其中,Loss
top_k为所述损失量;
为所述预测概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;
为通过优化算法处理后所述发音概率矩阵中对所述扩展语音信号内第t帧、第c个发音的概率值;T为帧总数;C为第t帧内识别的发音
帧的所有发音进行排序后,第c个发音对应的序号;K为预设参数。。
可选地,所述异步语音识别模型内的第一网络层级多于所述实时语音识别模型内的第二网络层级。
可选地,所述语音识别单元153包括:
将所述目标语音信号划分为多个音频帧;
分别对各个所述音频帧进行离散傅里叶变换,得到各个所述音频帧对应的语音频谱;
基于帧编号,依次将各个所述音频帧对应的所述语音频谱导入所述实时语音识别模型,输出所述文本信息。
可选地,所述语音识别的装置还包括:
训练集扩充单元,用于将所述目标语音信号导入所述目标语言类型对应的训练集
因此,本申请实施例提供的语音识别的装置同样可以通过样本数量较大的基础语 言文本转换为扩展语音信号,并通过目标语言类型对应的原始语音信号以及扩展语音信号对目标语言类型对应的实时语音识别模型进行训练,并通过训练后的实时语音识别模型对目标语音信号进行语音识别,输出文本信息,从而能够增加训练非基础语言的实时语音识别模型训练所需的样本个数,从而提高了语音识别的准确性以及适用性。
图16为本申请一实施例提供的终端设备的结构示意图。如图16所示,该实施例的终端设备16包括:至少一个处理器160(图16中仅示出一个)处理器、存储器161以及存储在所述存储器161中并可在所述至少一个处理器160上运行的计算机程序162,所述处理器160执行所述计算机程序162时实现上述任意各个语音识别的方法实施例中的步骤。
所述终端设备16可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备可包括,但不仅限于,处理器160、存储器161。本领域技术人员可以理解,图16仅仅是终端设备16的举例,并不构成对终端设备16的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。
所称处理器160可以是中央处理单元(Central Processing Unit,CPU),该处理器160还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器161在一些实施例中可以是所述终端设备16的内部存储单元,例如终端设备16的硬盘或内存。所述存储器161在另一些实施例中也可以是所述**装置/终端设备16的外部存储设备,例如所述终端设备16上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器161还可以既包括所述终端设备16的内部存储单元也包括外部存储设备。所述存储器161用于存储操作系统、应用程序、引导装载程序(BootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码等。所述存储器161还可以用于暂时地存储已经输出或者将要输出的数据。
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法 实施例中的对应过程,在此不再赘述。
本申请实施例还提供了一种网络设备,该网络设备包括:至少一个处理器、存储器以及存储在所述存储器中并可在所述至少一个处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任意各个方法实施例中的步骤。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在移动终端上运行时,使得移动终端执行时实现可实现上述各个方法实施例中的步骤。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实 施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。
Claims (12)
- 一种语音识别的方法,其特征在于,包括:获取待识别的目标语音信号;确定所述目标语音信号的目标语言类型;将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息;其中,所述语音识别模型是通过训练样本集训练得到的,所述训练样本集包括多个扩展语音信号、每个扩展语音信号对应的扩展文本信息、每个扩展语音信号对应的原始语音信号以及每个原始语音信号对应的原始文本信息,所述扩展语音信号是基于基础语言类型的已有文本转换得到的。
- 根据权利要求1所述的方法,其特征在于,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:获取所述基础语言类型对应的已有文本;将所述已有文本转换成所述目标语言类型对应的扩展语音文本;生成所述扩展语音文本对应的所述扩展语音信号。
- 根据权利要求1所述的方法,其特征在于,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之前,还包括:通过所述训练集中的所述原始语音信号以及与所述原始语音信号对应的原始语言文本,对第一原生语音模型进行训练,得到异步语音识别模型;基于所述异步语音识别模型,输出所述扩展语音信号对应的发音概率矩阵;根据所述发音概率矩阵以及所述扩展语音信号,对第二原生语音模型进行训练,得到所述实时语音识别模型。
- 根据权利要求3所述的方法,所述根据所述发音概率矩阵以及所述扩展语音信号,对第二原生语音模型进行训练,得到所述实时语音识别模型,包括:根据发音概率矩阵以及所述扩展语音信号,对所述第二原生语音模型进行粗粒度训练,得到准实时语音模型;根据所述原始语音信号以及所述原始语言文本,对所述准实时语音模型进行细粒度训练,得到所述实时语音识别模型。
- 根据权利要求4所述的方法,其特征在于,所述根据发音概率矩阵以及所述扩展语音文本,对所述第二原生语音模型进行粗粒度训练,得到准实时语音模型,包括:将所述扩展语音信号导入所述第二原生语音模型,确定所述扩展语音信号对应的预测概率矩阵;所述发音概率矩阵以及所述预测概率矩阵导入预设的损失函数,计算所述第二原生语音模型的损失量;基于所述损失量调整所述第二原生语音模型内的网络参量,得到所述准实时语音识别模型。
- 根据权利要求3所述的方法,所述异步语音识别模型内的第一网络层级多于所述实时语音识别模型内的第二网络层级。
- 根据权利要求1-7任一所述的方法,其特征在于,所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息,包括:将所述目标语音信号划分为多个音频帧;分别对各个所述音频帧进行离散傅里叶变换,得到各个所述音频帧对应的语音频谱;基于帧编号,依次将各个所述音频帧对应的所述语音频谱导入所述实时语音识别模型,输出所述文本信息。
- 根据权利要求1-7任一项所述的方法,其特征在于,在所述将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息之后,还包括:将所述目标语音信号导入所述目标语言类型对应的训练集。
- 一种语音识别的装置,其特征在于,包括:目标语音信号获取单元,用于获取待识别的目标语音信号;目标语言类型识别单元,用于确定所述目标语音信号的目标语言类型;语音识别单元,用于将所述目标语言信号输入至与所述目标语言类型对应的语音识别模型,获得所述语音识别模型输出的文本信息;其中,所述语音识别模型是通过训练样本集训练得到的,所述训练样本集包括多个扩展语音信号、每个扩展语音信号对应的扩展文本信息、每个扩展语音信号对应的原始语音信号以及每个原始语音信号对应的原始文本信息,所述扩展语音信号是基于基础语言类型的已有文本转换得到的。
- 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至9任一项所述的方法。
- 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至9任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP20910138.5A EP4064276A4 (en) | 2019-12-31 | 2020-10-30 | METHOD AND DEVICE FOR VOICE RECOGNITION, TERMINAL AND STORAGE MEDIA |
| US17/789,880 US20230072352A1 (en) | 2019-12-31 | 2020-10-30 | Speech Recognition Method and Apparatus, Terminal, and Storage Medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911409041.5A CN111261144B (zh) | 2019-12-31 | 2019-12-31 | 一种语音识别的方法、装置、终端以及存储介质 |
| CN201911409041.5 | 2019-12-31 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021135611A1 true WO2021135611A1 (zh) | 2021-07-08 |
Family
ID=70955226
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/125608 Ceased WO2021135611A1 (zh) | 2019-12-31 | 2020-10-30 | 一种语音识别的方法、装置、终端以及存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20230072352A1 (zh) |
| EP (1) | EP4064276A4 (zh) |
| CN (1) | CN111261144B (zh) |
| WO (1) | WO2021135611A1 (zh) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113989923A (zh) * | 2021-10-18 | 2022-01-28 | 魔珐(上海)信息科技有限公司 | 用于动作捕捉的标记点的类别确定方法及装置、终端 |
| CN114373446A (zh) * | 2021-12-28 | 2022-04-19 | 北京字跳网络技术有限公司 | 一种会议语种确定方法、装置及电子设备 |
| CN114694655A (zh) * | 2022-03-28 | 2022-07-01 | 广东电力信息科技有限公司 | 一种针对粤语音频的拓展方法及语音识别方法 |
| CN120356485A (zh) * | 2025-06-20 | 2025-07-22 | 北京策腾数字科技集团有限公司 | 基于语音处理的发音识别学习辅助方法及系统 |
Families Citing this family (38)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111261144B (zh) * | 2019-12-31 | 2023-03-03 | 华为技术有限公司 | 一种语音识别的方法、装置、终端以及存储介质 |
| CN113903330B (zh) * | 2020-06-18 | 2025-03-18 | 大众问问(北京)信息科技有限公司 | 语音训练数据获取、模型训练方法、装置及电子设备 |
| CN111736712B (zh) * | 2020-06-24 | 2023-08-18 | 北京百度网讯科技有限公司 | 输入信息的预测方法、系统、服务器及电子设备 |
| EP3956884B1 (en) * | 2020-07-08 | 2023-12-20 | Google LLC | Identification and utilization of misrecognitions in automatic speech recognition |
| US12198689B1 (en) * | 2020-08-10 | 2025-01-14 | Summer Institute of Linguistics, Inc. | Systems and methods for multilingual dialogue interactions using dynamic automatic speech recognition and processing |
| CN112116903B (zh) * | 2020-08-17 | 2024-09-13 | 北京大米科技有限公司 | 语音合成模型的生成方法、装置、存储介质及电子设备 |
| CN112233651B (zh) * | 2020-10-10 | 2024-06-04 | 深圳前海微众银行股份有限公司 | 方言类型的确定方法、装置、设备及存储介质 |
| US11568858B2 (en) | 2020-10-17 | 2023-01-31 | International Business Machines Corporation | Transliteration based data augmentation for training multilingual ASR acoustic models in low resource settings |
| CN112509578A (zh) * | 2020-12-10 | 2021-03-16 | 北京有竹居网络技术有限公司 | 一种语音信息的识别方法、装置、电子设备和存储介质 |
| CN113823271B (zh) * | 2020-12-18 | 2024-07-16 | 京东科技控股股份有限公司 | 语音分类模型的训练方法、装置、计算机设备及存储介质 |
| CN114678007B (zh) * | 2020-12-24 | 2025-08-08 | 暗物智能科技(广州)有限公司 | 尼日利亚皮软语语音模型的训练方法、识别方法及装置 |
| CN114678010B (zh) * | 2020-12-24 | 2025-02-25 | 暗物智能科技(广州)有限公司 | 豪萨语语音模型的训练方法、识别方法及装置 |
| CN112669825A (zh) * | 2020-12-24 | 2021-04-16 | 杭州中科先进技术研究院有限公司 | 一种通过语音合成方法自动训练的语音识别系统及方法 |
| CN113593525B (zh) * | 2021-01-26 | 2024-08-06 | 腾讯科技(深圳)有限公司 | 口音分类模型训练和口音分类方法、装置和存储介质 |
| CN113012706B (zh) * | 2021-02-18 | 2023-06-27 | 联想(北京)有限公司 | 一种数据处理方法、装置及电子设备 |
| CN113129870B (zh) * | 2021-03-23 | 2022-03-25 | 北京百度网讯科技有限公司 | 语音识别模型的训练方法、装置、设备和存储介质 |
| CN113327584B (zh) * | 2021-05-28 | 2024-02-27 | 平安科技(深圳)有限公司 | 语种识别方法、装置、设备及存储介质 |
| CN113345470B (zh) * | 2021-06-17 | 2022-10-18 | 青岛聚看云科技有限公司 | 一种k歌内容审核方法、显示设备及服务器 |
| CN113421554B (zh) * | 2021-07-05 | 2024-01-16 | 平安科技(深圳)有限公司 | 语音关键词检测模型处理方法、装置及计算机设备 |
| CN115691475B (zh) * | 2021-07-23 | 2025-12-12 | 澜至电子科技(成都)有限公司 | 用于训练语音识别模型的方法以及语音识别方法 |
| CN114333757A (zh) * | 2021-08-19 | 2022-04-12 | 腾讯科技(北京)有限公司 | 语音模型训练方法、装置、可读介质及电子设备 |
| CN114442028A (zh) * | 2021-12-30 | 2022-05-06 | 中航华东光电(上海)有限公司 | 一种虚拟场景交互语音hrtf定位方法 |
| CN114596845A (zh) * | 2022-04-13 | 2022-06-07 | 马上消费金融股份有限公司 | 语音识别模型的训练方法、语音识别方法及装置 |
| CN115167192A (zh) * | 2022-05-10 | 2022-10-11 | 深圳市海创云科技有限公司 | 一种基于5g物联网智能语音识别控制系统 |
| US12206820B2 (en) * | 2022-07-29 | 2025-01-21 | Realnetworks Llc | Detection of unwanted calls or caller intent based on a fusion of acoustic and textual analysis of calls |
| CN115691467A (zh) * | 2022-09-15 | 2023-02-03 | 北京三快在线科技有限公司 | 语音录制的方法和计算机程序产品 |
| CN116052720B (zh) * | 2022-12-29 | 2025-12-12 | 科大讯飞股份有限公司 | 语音检错方法、装置、电子设备及存储介质 |
| CN116312489A (zh) * | 2023-01-30 | 2023-06-23 | 华为技术有限公司 | 一种模型训练方法及其相关设备 |
| CN116312477A (zh) * | 2023-02-14 | 2023-06-23 | 北京声智科技有限公司 | 语音处理方法、装置、设备及存储介质 |
| CN118588058A (zh) * | 2023-03-03 | 2024-09-03 | 抖音视界有限公司 | 语音处理方法、装置及电子设备 |
| CN116543742B (zh) * | 2023-06-16 | 2026-04-14 | 平安科技(深圳)有限公司 | 基于文本的语音合成方法、装置、设备及存储介质 |
| CN116894498B (zh) * | 2023-07-24 | 2026-03-31 | 平安科技(深圳)有限公司 | 网络模型的训练方法、策略识别方法、装置以及设备 |
| WO2025047998A1 (ko) * | 2023-08-29 | 2025-03-06 | 주식회사 엔씨소프트 | 지정된 텍스트에 대응하는 음성 신호를 식별하기 위한 전자 장치, 방법, 및 컴퓨터 판독 가능 저장 매체 |
| WO2025227382A1 (zh) * | 2024-04-30 | 2025-11-06 | 达纳(北京)科技有限公司 | 语音交互方法、装置、客户端和计算机可读存储介质 |
| CN118800216B (zh) * | 2024-07-17 | 2025-03-14 | 安徽中融芯航科技有限责任公司 | 一种具有方言语音识别的智能终端系统及装置 |
| CN119520569A (zh) * | 2024-12-04 | 2025-02-25 | 中国工商银行股份有限公司 | 用于电力信息通信网络的信号处理方法、装置和设备 |
| CN120148477A (zh) * | 2024-12-31 | 2025-06-13 | 中国电信股份有限公司技术创新中心 | 跨语种的音频检测方法、装置及相关设备 |
| CN120766688B (zh) * | 2025-05-13 | 2026-01-30 | 北京美蓝智达科技有限公司 | 一种用于人机交互的用户标签获取方法 |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130070928A1 (en) * | 2011-09-21 | 2013-03-21 | Daniel P. W. Ellis | Methods, systems, and media for mobile audio event recognition |
| US20170162194A1 (en) * | 2015-12-04 | 2017-06-08 | Conexant Systems, Inc. | Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network |
| CN107731228A (zh) * | 2017-09-20 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | 英文语音信息的文本转换方法和装置 |
| CN107945792A (zh) * | 2017-11-06 | 2018-04-20 | 百度在线网络技术(北京)有限公司 | 语音处理方法和装置 |
| CN109003601A (zh) * | 2018-08-31 | 2018-12-14 | 北京工商大学 | 一种针对低资源土家语的跨语言端到端语音识别方法 |
| CN110148403A (zh) * | 2019-05-21 | 2019-08-20 | 腾讯科技(深圳)有限公司 | 解码网络生成方法、语音识别方法、装置、设备及介质 |
| CN110148400A (zh) * | 2018-07-18 | 2019-08-20 | 腾讯科技(深圳)有限公司 | 发音类型的识别方法、模型的训练方法、装置及设备 |
| CN110197658A (zh) * | 2019-05-30 | 2019-09-03 | 百度在线网络技术(北京)有限公司 | 语音处理方法、装置以及电子设备 |
| CN110473523A (zh) * | 2019-08-30 | 2019-11-19 | 北京大米科技有限公司 | 一种语音识别方法、装置、存储介质及终端 |
| CN111261144A (zh) * | 2019-12-31 | 2020-06-09 | 华为技术有限公司 | 一种语音识别的方法、装置、终端以及存储介质 |
Family Cites Families (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190279613A1 (en) * | 2018-03-06 | 2019-09-12 | Ford Global Technologies, Llc | Dialect and language recognition for speech detection in vehicles |
| US10629193B2 (en) * | 2018-03-09 | 2020-04-21 | Microsoft Technology Licensing, Llc | Advancing word-based speech recognition processing |
| US11133001B2 (en) * | 2018-03-20 | 2021-09-28 | Microsoft Technology Licensing, Llc | Generating dialogue events for natural language system |
| US20190325862A1 (en) * | 2018-04-23 | 2019-10-24 | Eta Compute, Inc. | Neural network for continuous speech segmentation and recognition |
| CN110914898B (zh) * | 2018-05-28 | 2024-05-24 | 北京嘀嘀无限科技发展有限公司 | 一种用于语音识别的系统和方法 |
| US10997967B2 (en) * | 2019-04-18 | 2021-05-04 | Honeywell International Inc. | Methods and systems for cockpit speech recognition acoustic model training with multi-level corpus data augmentation |
| CN110211565B (zh) * | 2019-05-06 | 2023-04-04 | 平安科技(深圳)有限公司 | 方言识别方法、装置及计算机可读存储介质 |
| US11545136B2 (en) * | 2019-10-21 | 2023-01-03 | Nuance Communications, Inc. | System and method using parameterized speech synthesis to train acoustic models |
| US11749281B2 (en) * | 2019-12-04 | 2023-09-05 | Soundhound Ai Ip, Llc | Neural speech-to-meaning |
| US11308938B2 (en) * | 2019-12-05 | 2022-04-19 | Soundhound, Inc. | Synthesizing speech recognition training data |
| US11823697B2 (en) * | 2021-08-20 | 2023-11-21 | Google Llc | Improving speech recognition with speech synthesis-based model adapation |
-
2019
- 2019-12-31 CN CN201911409041.5A patent/CN111261144B/zh not_active Expired - Fee Related
-
2020
- 2020-10-30 EP EP20910138.5A patent/EP4064276A4/en not_active Withdrawn
- 2020-10-30 US US17/789,880 patent/US20230072352A1/en not_active Abandoned
- 2020-10-30 WO PCT/CN2020/125608 patent/WO2021135611A1/zh not_active Ceased
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20130070928A1 (en) * | 2011-09-21 | 2013-03-21 | Daniel P. W. Ellis | Methods, systems, and media for mobile audio event recognition |
| US20170162194A1 (en) * | 2015-12-04 | 2017-06-08 | Conexant Systems, Inc. | Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network |
| CN107731228A (zh) * | 2017-09-20 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | 英文语音信息的文本转换方法和装置 |
| CN107945792A (zh) * | 2017-11-06 | 2018-04-20 | 百度在线网络技术(北京)有限公司 | 语音处理方法和装置 |
| CN110148400A (zh) * | 2018-07-18 | 2019-08-20 | 腾讯科技(深圳)有限公司 | 发音类型的识别方法、模型的训练方法、装置及设备 |
| CN109003601A (zh) * | 2018-08-31 | 2018-12-14 | 北京工商大学 | 一种针对低资源土家语的跨语言端到端语音识别方法 |
| CN110148403A (zh) * | 2019-05-21 | 2019-08-20 | 腾讯科技(深圳)有限公司 | 解码网络生成方法、语音识别方法、装置、设备及介质 |
| CN110197658A (zh) * | 2019-05-30 | 2019-09-03 | 百度在线网络技术(北京)有限公司 | 语音处理方法、装置以及电子设备 |
| CN110473523A (zh) * | 2019-08-30 | 2019-11-19 | 北京大米科技有限公司 | 一种语音识别方法、装置、存储介质及终端 |
| CN111261144A (zh) * | 2019-12-31 | 2020-06-09 | 华为技术有限公司 | 一种语音识别的方法、装置、终端以及存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4064276A4 |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113989923A (zh) * | 2021-10-18 | 2022-01-28 | 魔珐(上海)信息科技有限公司 | 用于动作捕捉的标记点的类别确定方法及装置、终端 |
| CN114373446A (zh) * | 2021-12-28 | 2022-04-19 | 北京字跳网络技术有限公司 | 一种会议语种确定方法、装置及电子设备 |
| CN114694655A (zh) * | 2022-03-28 | 2022-07-01 | 广东电力信息科技有限公司 | 一种针对粤语音频的拓展方法及语音识别方法 |
| CN114694655B (zh) * | 2022-03-28 | 2025-07-08 | 南方电网数字企业科技(广东)有限公司 | 一种针对粤语音频的拓展方法及语音识别方法 |
| CN120356485A (zh) * | 2025-06-20 | 2025-07-22 | 北京策腾数字科技集团有限公司 | 基于语音处理的发音识别学习辅助方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4064276A1 (en) | 2022-09-28 |
| EP4064276A4 (en) | 2023-05-10 |
| US20230072352A1 (en) | 2023-03-09 |
| CN111261144A (zh) | 2020-06-09 |
| CN111261144B (zh) | 2023-03-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111261144B (zh) | 一种语音识别的方法、装置、终端以及存储介质 | |
| CN109697973B (zh) | 一种韵律层级标注的方法、模型训练的方法及装置 | |
| CN110890093B (zh) | 一种基于人工智能的智能设备唤醒方法和装置 | |
| CN110268469B (zh) | 服务器侧热词 | |
| US8738375B2 (en) | System and method for optimizing speech recognition and natural language parameters with user feedback | |
| EP2821992B1 (en) | Method for updating voiceprint feature model and terminal | |
| WO2021196981A1 (zh) | 语音交互方法、装置和终端设备 | |
| US10811005B2 (en) | Adapting voice input processing based on voice input characteristics | |
| WO2020073530A1 (zh) | 客服机器人会话文本分类方法及装置、电子设备、计算机可读存储介质 | |
| WO2021051577A1 (zh) | 语音情绪识别方法、装置、设备及存储介质 | |
| US9754581B2 (en) | Reminder setting method and apparatus | |
| CN107274885A (zh) | 语音识别方法及相关产品 | |
| KR20190130636A (ko) | 기계번역 방법, 장치, 컴퓨터 기기 및 기억매체 | |
| CN107170454A (zh) | 语音识别方法及相关产品 | |
| CN106210239A (zh) | 恶意来电者声纹的自动识别方法、装置和移动终端 | |
| CN108735204A (zh) | 用于执行与用户话语相对应的任务的设备 | |
| US20180277102A1 (en) | System and Method for Optimizing Speech Recognition and Natural Language Parameters with User Feedback | |
| CN113220848B (zh) | 用于人机交互的自动问答方法、装置和智能设备 | |
| US20230085161A1 (en) | Automatic translation between sign language and spoken language | |
| CN111128134A (zh) | 声学模型训练方法和语音唤醒方法、装置及电子设备 | |
| CN106341539A (zh) | 恶意来电者声纹的自动取证方法、装置和移动终端 | |
| CN109545221A (zh) | 参数调整方法、移动终端及计算机可读存储介质 | |
| CN113076397A (zh) | 意图识别方法、装置、电子设备及存储介质 | |
| CN116403573A (zh) | 一种语音识别方法 | |
| CN104462058A (zh) | 字符串识别方法及装置 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20910138 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2020910138 Country of ref document: EP Effective date: 20220621 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2020910138 Country of ref document: EP |



