WO2022227973A1 - 构建语音识别模型和语音处理的方法和系统 - Google Patents
构建语音识别模型和语音处理的方法和系统 Download PDFInfo
- Publication number
- WO2022227973A1 WO2022227973A1 PCT/CN2022/083190 CN2022083190W WO2022227973A1 WO 2022227973 A1 WO2022227973 A1 WO 2022227973A1 CN 2022083190 W CN2022083190 W CN 2022083190W WO 2022227973 A1 WO2022227973 A1 WO 2022227973A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- keyword
- group
- decoding
- phoneme sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Definitions
- Embodiments of the present disclosure generally relate to the field of computers, and more particularly, to methods and systems for building speech recognition models and speech processing.
- semantic understanding is an important technology for realizing speech interaction.
- semantic understanding is the key for implementing voice control due to the limited computing power.
- Embodiments of the present disclosure provide a solution for building a speech recognition model and speech processing.
- a method of building a speech recognition model includes: acquiring a target keyword; acquiring a synonym group semantically associated with the target keyword; using the target keyword and the synonym group to train a language model to obtain a target language model; generating a first decoding map according to the target language model, the first The decoding graph indicates a plurality of decoding paths satisfying the grammatical constraint rules determined based on the target keyword and the synonym group; and based on the first decoding graph, a speech recognition model is determined.
- the method may, for example, be performed by a first computing device with relatively strong computing capabilities.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- one or more steps in the method may also be performed collaboratively by the user terminal and the cloud.
- the obtained speech recognition model may, for example, be deployed to a second computing device.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, bracelets, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can construct a speech recognition model with keyword generalization recognition capability, so that, for example, a second computing device with less computing power can have keyword generalization recognition capability, thereby improving the user's ability to recognize and generalize keywords. interactive experience.
- the target keywords include keywords of speech input from an audio collector located at the user terminal.
- the target keyword includes a keyword input from a text collector, and the text collector is located at the user terminal.
- the user may, for example, directly provide voice input or text input to the user terminal, so that the user terminal can extract keywords from the voice input or text input, so as to construct a voice recognition model .
- the first computing device is a computing device different from the user terminal, for example, a cloud device or an edge computing device
- the user can input voice or text using an interface provided by the user terminal, for example.
- Such speech input or text input may be sent to the first computing device to enable the first computing device to obtain keywords for the construction of the speech recognition model.
- the user can customize the keywords that the speech recognition model can support, so that the degree of personalization of the speech recognition model can be improved.
- determining the synonym group semantically associated with the target keyword comprises: determining the semantics of the target keyword; and determining the synonym group based on at least the semantics of the target keyword, wherein each of the synonym groups The difference between the semantics of the synonym and the semantics of the target keyword is less than the difference threshold.
- the first computing device can automatically expand the associated synonym groups based on semantics without depending on the user's input, thereby improving and reducing the user's interaction overhead.
- determining the synonym group based on at least the semantics of the target keyword includes: determining the synonym group based on the semantics of the target keyword and the length of the target keyword, wherein the length of each synonym in the synonym group is the same as the length of each synonym in the synonym group.
- the difference in the length of the target keywords is less than the length threshold.
- the length of a keyword may represent, for example, the number of characters or the number of words included in the keyword. Based on this method, the synonyms in the synonym group can be made to have similar lengths, so that the complexity of decoding and searching using the decoding map can be reduced.
- determining the synonym group based at least on the semantics of the target keyword includes: obtaining a plurality of candidate synonyms based on the semantics of the target keyword; providing the plurality of candidate synonyms to a user; and based on the user received from the user An input determines a synonym group from the plurality of candidate synonyms, the user input indicating that at least one candidate synonym of the plurality of candidate synonyms is excluded or confirmed.
- the synonym group used for training the speech recognition model can be further adjusted based on the user feedback, which can make the obtained speech recognition model more in line with the user's usage habits.
- the target keyword includes at least a first keyword and a second keyword
- determining the speech recognition model based on the first decoding map includes: obtaining a first set of decoding paths and A second group of decoding paths, the first group of decoding paths including decoding paths corresponding to the first keyword and the first synonym group semantically associated with the first keyword, the second group of decoding paths including decoding paths corresponding to the second keyword and the a decoding path corresponding to the second synonym group semantically associated with the second keyword; generating a first subgraph based on the first group of decoding paths; generating a second subgraph based on the second group of decoding paths; and generating a second subgraph based on at least the first subgraph and The second subgraph is used to determine the speech recognition model.
- the generated decoding map has lower complexity and can support faster decoding search, thereby reducing computational overhead and storage overhead.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path is a decoding path corresponding to the first keyword
- the second decoding path is a synonym for the first keyword
- acquiring the target keywords includes: acquiring a first keyword group according to pre-stored historical keywords and received keywords; and in response to determining that the number of keywords in the first keyword group exceeds a predetermined threshold, The target keyword is obtained from the first keyword group based on a predetermined threshold.
- the first computing device may reserve only a predetermined threshold number of keywords in the first keyword group as target keywords.
- acquiring the target keyword from the first keyword group based on a predetermined threshold includes: acquiring the target keyword from the first keyword group according to attributes of the keywords in the target keyword, the target keyword is The number is a predetermined threshold. For example, one or more historical keywords created earliest may be deleted from the first keyword group to obtain a predetermined threshold number of keywords.
- acquiring target keywords from the first keyword group based on a predetermined threshold includes: acquiring target keywords from the first keyword group according to a user instruction, and the number of target keywords is a predetermined threshold. For example, which keywords in the first keyword group to retain as target keywords may be selected according to user input.
- the first computing device may further instruct to provide the speech recognition model to the target computing device (eg, the second computing device) for deployment of the speech recognition model on the target computing device. Based on this approach, automatic deployment of speech recognition models can be supported.
- a method of speech processing includes: receiving a speech input; and utilizing a speech recognition model to determine a textual representation associated with the speech input, wherein the speech recognition model is obtained based on: obtaining a target keyword; obtaining a semantic association with the target keyword the synonym group; use the target keyword and the synonym group to train the language model to obtain the target language model; generate the first decoding map according to the target language model, and the first decoding map indicates that the grammatical constraint rules determined based on the target keyword and the synonym group are satisfied and determining a speech recognition model based on the first decoding graph.
- the speech recognition model may be obtained by the first computing device.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- the step of obtaining the speech recognition model may also be performed collaboratively by the user terminal and the cloud.
- the speech processing method may, for example, be performed by a second computing device.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, bracelets, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can enable, for example, a second computing device with less computing power to have a keyword generalization and recognition capability, thereby improving the user's voice interaction experience.
- the target keywords include keywords of speech input from an audio collector located at the user terminal.
- the target keyword includes a keyword input from a text collector, and the text collector is located at the user terminal.
- the user may, for example, directly provide voice input or text input to the user terminal, so that the user terminal can extract keywords from the voice input or text input, so as to construct a voice recognition model .
- the first computing device is a computing device different from the user terminal, such as a cloud device or an edge computing device
- the user can input voice or text by using an interface provided by the user terminal, for example.
- Such speech input or text input may be sent to the first computing device to enable the first computing device to obtain keywords for the construction of the speech recognition model.
- the user can customize the keywords that the speech recognition model can support, so that the degree of personalization of the speech recognition model can be achieved.
- determining a synonym group semantically associated with the target keyword comprises: determining the semantics of the target keyword; and determining a synonym group based on at least the semantics of the target keyword, wherein each of the synonym groups The difference between the semantics of the synonym and the semantics of the target keyword is less than the difference threshold.
- the first computing device can automatically expand the associated synonym groups based on semantics without depending on the user's input, thereby improving and reducing the user's interaction overhead.
- determining the synonym group based on at least the semantics of the target keyword includes: determining the synonym group based on the semantics of the target keyword and the length of the target keyword, wherein the length of each synonym in the synonym group is the same as the length of each synonym in the synonym group.
- the difference in the length of the target keywords is less than the length threshold.
- the length of a keyword may represent, for example, the number of characters or the number of words included in the keyword. Based on this method, the synonyms in the synonym group can be made to have similar lengths, so that the complexity of decoding and searching using the decoding map can be reduced.
- determining the synonym group based on at least the semantics of the target keyword includes: obtaining a plurality of candidate synonyms based on the semantics of the target keyword; providing the plurality of candidate synonyms to the user; and based on the user received from the user An input determines a synonym group from the plurality of candidate synonyms, the user input indicating that at least one candidate synonym of the plurality of candidate synonyms is excluded or confirmed.
- the synonym group used for training the speech recognition model can be further adjusted based on the user feedback, which can make the obtained speech recognition model more in line with the user's usage habits.
- the target keyword includes at least a first keyword and a second keyword
- determining the speech recognition model based on the first decoding map includes: obtaining a first set of decoding paths and A second group of decoding paths, the first group of decoding paths including decoding paths corresponding to the first keyword and the first synonym group semantically associated with the first keyword, the second group of decoding paths including decoding paths corresponding to the second keyword and the a decoding path corresponding to a second synonym group semantically associated with the second keyword; generating a first subgraph based on the first group of decoding paths and generating a second subgraph based on the second group of decoding paths; and generating a second subgraph based on at least the first subgraph and The second subgraph is used to determine the speech recognition model.
- the generated decoding map has lower complexity and can support faster decoding search, thereby reducing computational overhead and storage overhead.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path is a decoding path corresponding to the first keyword
- the second decoding path is a first synonym group
- the decoding paths corresponding to the synonyms in , the first decoding path and each second decoding path have the same weight in the first subgraph. Based on this method, faster decoding and searching for the expanded synonyms can be implemented, thereby reducing computational overhead and storage overhead.
- acquiring the target keyword includes: acquiring a first keyword group according to pre-stored historical keywords and received keywords; and in response to determining that the number of keywords in the first keyword group exceeds a predetermined threshold, The target keyword is obtained from the first keyword group based on a predetermined threshold. For example, only a predetermined threshold number of keywords in the first keyword group may be reserved as target keywords.
- acquiring the target keyword from the first keyword group based on a predetermined threshold includes: acquiring the target keyword from the first keyword group according to attributes of the keywords in the target keyword, the target keyword The number is a predetermined threshold. For example, one or more historical keywords created earliest may be deleted from the first keyword group to obtain a predetermined threshold number of keywords.
- acquiring target keywords from the first keyword group based on a predetermined threshold includes: acquiring target keywords from the first keyword group according to a user instruction, and the number of target keywords is a predetermined threshold. For example, which keywords in the first keyword group to retain as target keywords may be selected according to user input.
- the second computing device may also perform an action corresponding to the textual representation.
- the second computing device may also generate a corresponding control command based on the textual representation, and send it to the third computing device, so that the third computing device performs the corresponding action.
- the textual representation corresponds to the target keyword or a synonym in the group of synonyms.
- a speech model building system includes a keyword acquisition unit for acquiring target keywords; a synonym acquisition unit for acquiring synonym groups semantically associated with the target keywords; a model training unit for training a language model by using the target keywords and the synonym groups, to obtain a target language model; a decoding graph generation unit for generating a first decoding graph according to the target language model, the first decoding graph indicating a plurality of decoding paths satisfying the grammatical constraint rules determined based on the target keyword and the synonym group; and the model A determining unit, configured to determine a speech recognition model based on the first decoding map.
- the speech model building system may include, for example, a first computing device with relatively strong computing capabilities.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- the method may also be performed collaboratively by the user terminal and the cloud, for example.
- the obtained speech recognition model may, for example, be deployed to the second computing device.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, bracelets, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can construct a speech recognition model with keyword generalization recognition capability, so that, for example, a second computing device with less computing power can have keyword generalization recognition capability, thereby improving the user's ability to recognize and generalize keywords. interactive experience.
- the target keyword includes a keyword input from an audio collector, and the audio collector is located at the user terminal. In other embodiments of the third aspect, the target keyword includes a keyword input from a text collector, and the text collector is located at the user terminal.
- the user may, for example, directly provide voice input or text input to the user terminal, so that the user terminal can extract keywords from the voice input or text input, so as to construct a voice recognition model .
- the first computing device is a computing device different from the user terminal, such as a cloud device or an edge computing device
- the user can input voice or text by using an interface provided by the user terminal, for example.
- Such speech input or text input may be sent to the first computing device to enable the first computing device to obtain keywords for the construction of the speech recognition model.
- the user can customize the keywords that the speech recognition model can support, so that the degree of personalization of the speech recognition model can be achieved.
- the synonym obtaining unit is further configured to: determine the semantics of the target keyword; and determine a synonym group based on at least the semantics of the target keyword, wherein the semantics of each synonym in the synonym group is the same as the target keyword The difference in the semantics of the words is less than the difference threshold.
- the first computing device can automatically expand the associated synonym groups based on semantics without depending on the user's input, thereby improving and reducing the user's interaction overhead.
- the synonym obtaining unit is further configured to: determine a synonym group based on the semantics of the target keyword and the length of the target keyword, wherein the length of each synonym in the synonym group is the same as the length of the target keyword The difference is less than the length threshold.
- the length of a keyword may represent, for example, the number of characters or the number of words included in the keyword. Based on this method, the synonyms in the synonym group can be made to have similar lengths, so that the complexity of decoding and searching using the decoding map can be reduced.
- the synonym obtaining unit is further configured to: obtain multiple candidate synonyms based on the semantics of the target keyword; provide the user with multiple candidate synonyms; and based on user input received from the user, obtain multiple synonym candidates from multiple synonyms A synonym group is determined among the candidate synonyms, and the user input indicates that at least one candidate synonym of the plurality of candidate synonyms is excluded or confirmed.
- the synonym group used for training the speech recognition model can be further adjusted based on the user feedback, which can make the obtained speech recognition model more in line with the user's usage habits.
- the target keyword includes at least a first keyword and a second keyword
- the model merging unit is further configured to: obtain the first set of decoding paths and the second set of decoding paths from the first decoding map Paths, the first set of decoding paths includes decoding paths corresponding to the first keyword and the first synonym group semantically associated with the first keyword, and the second group of decoding paths includes decoding paths corresponding to the second keyword and the second keyword decoding paths corresponding to the semantically associated second synonym groups; generating a first subgraph based on the first group of decoding paths and generating a second subgraph based on the second group of decoding paths; and at least based on the first subgraph and the second subgraph to determine the speech recognition model.
- the generated decoding map has lower complexity and can support faster decoding search, thereby reducing computational overhead and storage overhead.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path is a decoding path corresponding to the first keyword
- the second decoding path is a first synonym group
- the decoding paths corresponding to the synonyms in , the first decoding path and each second decoding path have the same weight in the first subgraph. Based on this method, faster decoding and searching for the expanded synonyms can be implemented, thereby reducing computational overhead and storage overhead.
- the keyword obtaining unit is further configured to: obtain the first keyword group according to the pre-stored historical keywords and the received keywords; and in response to determining that the number of keywords in the first keyword group exceeds a predetermined number Threshold, the target keyword is obtained from the first keyword group based on a predetermined threshold.
- the keyword obtaining unit is further configured to: obtain target keywords from the first keyword group according to attributes of keywords in the target keywords, where the number of target keywords is a predetermined threshold. For example, one or more historical keywords created earliest may be deleted from the first keyword group to obtain a predetermined threshold number of keywords.
- the keyword acquisition unit is further configured to acquire target keywords from the first keyword group according to a user instruction, where the number of target keywords is a predetermined threshold. For example, which keywords in the first keyword group to retain as target keywords may be selected according to user input.
- the speech model building system may further instruct to provide the speech recognition model to the target computing device (eg, the second computing device) for deployment of the speech recognition model on the target computing device. Based on this approach, automatic deployment of speech recognition models can be supported.
- a speech processing system includes: a speech input unit for receiving speech input; and a speech processing unit for determining a textual representation associated with the speech input using a speech recognition model, wherein the speech recognition model is obtained based on: obtaining target keyword; obtain the synonym group semantically associated with the target keyword; use the target keyword and the synonym group to train the language model to obtain the target language model; generate a first decoding map according to the target language model, and the first decoding map indicates that the multiple decoding paths of the grammatical constraint rules determined by the target keyword and the synonym group; and determining a speech recognition model based on the first decoding map.
- the speech recognition model may be obtained by the first computing device.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- the step of obtaining the speech recognition model may also be performed collaboratively by the user terminal and the cloud.
- the speech processing system may, for example, comprise a second computing device.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, bracelets, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can enable, for example, a second computing device with less computing power to have a keyword generalization and recognition capability, thereby improving the user's voice interaction experience.
- the target keywords include keywords of speech input from an audio collector located at the user terminal.
- the target keyword includes a keyword input from a text collector, and the text collector is located at the user terminal.
- the user may, for example, directly provide voice input or text input to the user terminal, so that the user terminal can extract keywords from the voice input or text input, so as to construct a voice recognition model .
- the first computing device is a computing device different from the user terminal, such as a cloud device or an edge computing device
- the user can input voice or text by using an interface provided by the user terminal, for example.
- Such speech input or text input may be sent to the first computing device to enable the first computing device to obtain keywords for the construction of the speech recognition model.
- the user can customize the keywords that the speech recognition model can support, so that the degree of personalization of the speech recognition model can be achieved.
- determining the synonym group semantically associated with the target keyword comprises: determining the semantics of the target keyword; and determining the synonym group based on at least the semantics of the target keyword, wherein each of the synonym groups The difference between the semantics of the synonym and the semantics of the target keyword is less than the difference threshold.
- the first computing device can automatically expand the associated synonym groups based on semantics without depending on the user's input, thereby improving and reducing the user's interaction overhead.
- determining the synonym group based on at least the semantics of the target keyword includes: determining the synonym group based on the semantics of the target keyword and the length of the target keyword, wherein the length of each synonym in the synonym group is the same as the length of the target keyword.
- the difference in the length of the target keywords is less than the length threshold.
- the length of a keyword may represent, for example, the number of characters or the number of words included in the keyword. Based on this method, the synonyms in the synonym group can be made to have similar lengths, so that the complexity of decoding and searching using the decoding map can be reduced.
- determining the synonym group based on at least the semantics of the target keyword includes: obtaining a plurality of candidate synonyms based on the semantics of the target keyword; providing the plurality of candidate synonyms to the user; and based on the user received from the user An input determines a synonym group from the plurality of candidate synonyms, the user input indicating that at least one candidate synonym of the plurality of candidate synonyms is excluded or confirmed.
- the synonym group used for training the speech recognition model can be further adjusted based on the user feedback, which can make the obtained speech recognition model more in line with the user's usage habits.
- the target keyword includes at least a first keyword and a second keyword
- determining the speech recognition model based on the first decoding map includes: obtaining a first set of decoding paths and A second group of decoding paths, the first group of decoding paths including decoding paths corresponding to the first keyword and the first synonym group semantically associated with the first keyword, the second group of decoding paths including decoding paths corresponding to the second keyword and the a decoding path corresponding to a second synonym group semantically associated with the second keyword; generating a first subgraph based on the first group of decoding paths and generating a second subgraph based on the second group of decoding paths; and generating a second subgraph based on at least the first subgraph and The second subgraph is used to determine the speech recognition model.
- the generated decoding map has lower complexity and can support faster decoding search, thereby reducing computational overhead and storage overhead.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path is a decoding path corresponding to the first keyword
- the second decoding path is a first synonym group
- the decoding paths corresponding to the synonyms in , the first decoding path and each second decoding path have the same weight in the first subgraph. Based on this method, faster decoding and searching for the expanded synonyms can be implemented, thereby reducing computational overhead and storage overhead.
- acquiring the target keywords includes: acquiring a first keyword group according to pre-stored historical keywords and received keywords; and in response to determining that the number of keywords in the first keyword group exceeds a predetermined threshold, The target keyword is obtained from the first keyword group based on a predetermined threshold.
- acquiring the target keyword from the first keyword group based on a predetermined threshold includes: acquiring the target keyword from the first keyword group according to attributes of the keywords in the target keyword, the target keyword is The number is a predetermined threshold. For example, one or more historical keywords created earliest may be deleted from the first keyword group to obtain a predetermined threshold number of keywords.
- acquiring target keywords from the first keyword group based on a predetermined threshold includes: acquiring target keywords from the first keyword group according to a user instruction, and the number of target keywords is a predetermined threshold. For example, which keywords in the first keyword group to retain as target keywords may be selected according to user input.
- the speech processing system may also perform actions corresponding to the textual representation.
- the second computing device may also generate a corresponding control command based on the textual representation, and send it to the third computing device, so that the third computing device performs the corresponding action.
- the textual representation corresponds to the target keyword or a synonym in the group of synonyms.
- a method for building a speech recognition model includes: acquiring target language information; acquiring a synonymous phoneme sequence group associated with the target language information, where the synonymous phoneme sequence group includes at least one of the synonymous phoneme sequences, and at least one synonymous phoneme sequence is semantically related to the target language information
- the method may, for example, be performed by a first computing device with relatively strong computing capabilities.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- one or more steps in the method may also be performed collaboratively by the user terminal and the cloud.
- the obtained speech recognition model may, for example, be deployed to a second computing device.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, bracelets, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can construct a speech recognition model with the ability to generalize the phoneme sequence associated with the target language information, so that, for example, a second computing device with less computing power can have the phoneme sequence generalization capability. Recognition ability to improve user interaction experience.
- the target language information may include speech or text.
- the target language information includes speech input from an audio collector located at the user terminal.
- the keywords of the text input are obtained from a text collector at the user terminal.
- the target language information may be some short instruction words or instruction sentences, such as "turn off”, “stop”, “pause”, “increase the volume”, “increase the volume” and so on.
- the user may, for example, directly provide voice input or text input to the user terminal, so that the user terminal can extract target language information from the voice input or text input to perform a speech recognition model Construct.
- the user can input voice or text by using an interface provided by the user terminal, for example.
- Such speech input or text input may be sent to the first computing device so that the first computing device can obtain target language information for the construction of the speech recognition model.
- the user can customize the phoneme sequence that the speech recognition model can support, so that the degree of personalization of the speech recognition model can be improved.
- obtaining a synonymous phoneme sequence group associated with the target language information includes: determining semantics of the target language information; and determining a synonymous phoneme sequence group based on at least the semantics of the target language information, wherein synonymous The difference between the semantics of each synonymous phoneme sequence in the phoneme sequence group and the semantics of the target language information is less than the difference threshold.
- the first computing device can automatically expand the associated synonymous phoneme sequence group based on semantics without relying on the user's input, thereby improving and reducing the user's interaction overhead.
- determining the synonymous phoneme sequence group based on at least the semantics of the target language information includes: determining a target phoneme sequence corresponding to the target language information; and based on the semantics of the target phoneme sequence and the length of the target phoneme sequence, A synonymous phoneme sequence group is determined, and the difference between the length of each synonymous phoneme sequence in the synonymous phoneme sequence group and the length of the target phoneme sequence is less than a length threshold.
- the length of a phoneme sequence may represent, for example, the number of phonemes (eg, initials and finals) included in the phoneme sequence.
- the target language information is text
- a phoneme sequence corresponding to the text can be obtained as the target phoneme sequence through a pronunciation dictionary.
- the target language information is speech
- the phoneme sequence of the speech can be obtained as the target phoneme sequence through an acoustic model.
- the synonymous phoneme sequences in the synonymous phoneme sequence group can be made to have similar lengths, thereby reducing the complexity of decoding and searching by using the decoding map.
- obtaining a synonymous phoneme sequence group semantically associated with the target language information includes: obtaining a plurality of candidate synonyms based on the semantics of the target keyword corresponding to the target language information; providing a user with a plurality of synonyms candidate synonyms; determining a synonym group from a plurality of candidate synonyms based on user input received from a user, the user input indicating that at least one candidate synonym in the plurality of candidate synonyms is excluded or confirmed; and obtaining a synonym based on the pronunciation dictionary and the synonym group. Semantic phoneme sequence group.
- the synonymous phoneme sequence group used for training the speech recognition model can be further adjusted based on user feedback, which can make the obtained speech recognition model more suitable for the user's usage habits.
- obtaining a set of synonymous phoneme sequences semantically associated with the target phoneme sequence comprises: receiving speech input from a user; and generating a set of synonymous phoneme sequences based on the speech input. For example, the semantics of the speech input is obtained based on the keywords corresponding to the speech input, so as to generate the synonymous phoneme sequence group.
- the synonymous phoneme sequence group used for training the speech recognition model can be supplemented based on the user feedback in the form of speech input, which can make the obtained speech recognition model more suitable for the user's usage habits.
- the target language information includes at least first language information and second language information
- determining the speech recognition model based on the first decoding map includes: obtaining a first set of decoding paths and A second set of decoding paths, the first set of decoding paths including the decoding paths of the first synonymous phoneme sequence group associated with the first linguistic information, the second set of decoding paths including the second synonymous phoneme sequence associated with the second linguistic information generating a first subgraph based on the first set of decoding paths; generating a second subgraph based on the second set of decoding paths; and determining a speech recognition model based on at least the first subgraph and the second subgraph.
- the generated decoding map has lower complexity and can support faster decoding search, thereby reducing computational overhead and storage overhead.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path and the second decoding path are decoding paths in the first synonymous phoneme sequence group
- the first The decoding path and the second decoding path have the same weight in the first subgraph.
- acquiring the target language information includes: acquiring a first language information group according to pre-stored historical language information and received language information; in response to determining that the number of language information in the first language information group exceeds a predetermined number a threshold, the target language information is acquired from the first language information group based on a predetermined threshold.
- acquiring the target language information from the first language information group based on the predetermined threshold includes: acquiring the target language information from the first language information group according to attributes of the language information in the target language information, the target language The amount of information is a predetermined threshold. For example, the oldest created one or more historical linguistic information may be deleted from the first set of linguistic information, thereby obtaining a predetermined threshold number of linguistic information.
- acquiring the target language information from the first language information group based on a predetermined threshold includes: acquiring the target language information from the first language information group according to a user instruction, the quantity of the target language information being the predetermined threshold . For example, which language information in the first language information group to retain as the target language information may be selected according to user input.
- the first computing device may further instruct to provide the speech recognition model to the target computing device (eg, the second computing device) for deployment of the speech recognition model on the target computing device. Based on this approach, automatic deployment of speech recognition models can be supported.
- a speech processing method comprising: receiving a speech instruction input; using a speech recognition model to obtain a phoneme sequence representation of the speech input, the speech recognition model being configured based on instruction semantic synonymy and if the phoneme sequence representation corresponds to a phoneme sequence in the phoneme sequence group, execute the instruction corresponding to the phoneme sequence representation.
- the speech recognition model may be obtained by the first computing device.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- the step of obtaining the speech recognition model may also be performed collaboratively by the user terminal and the cloud.
- the speech processing method may, for example, be performed by a second computing device.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, wristbands, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can enable, for example, a second computing device with relatively small computing power to have the capability of generalizing and recognizing phoneme sequences, without the need for natural language understanding by recognizing keywords, thereby reducing the performance requirements of the device , while also improving the user's voice interaction experience.
- the speech recognition model is obtained based on the following processes: obtaining target language information; obtaining a synonymous phoneme sequence group associated with the target language information, the synonymous phoneme sequence group including at least one synonymous phoneme sequence, the synonymous phoneme sequence is the phoneme sequence corresponding to the words and sentences semantically similar to the target language information; the language model is trained by using the synonymous phoneme sequence group to obtain the target language model; the first decoding map is generated according to the target language model, and the third A decoding map indicates a plurality of decoding paths that satisfy the grammar constraint rules determined based on the synonymous phoneme sequence group; and based on the first decoding map, a speech recognition model is determined.
- the constructed speech recognition model can realize the generalized recognition ability of customized target language information.
- the phoneme sequence representation does not match any phoneme sequence in the set of phoneme sequences, a notification of no recognition result is provided. Based on this method, the user's voice can be recognized in real time and efficiently, which improves the user's voice interaction experience.
- the speech recognition model is configured to identify a first group of phoneme sequences having a first semantic synonym and a second group of phoneme sequences having a second semantic synonym.
- the method may further include executing the first instruction if the phoneme sequence representation corresponds to a first phoneme sequence in the first phoneme sequence group, and if the phoneme sequence representation corresponds to a second phoneme sequence in the second phoneme sequence group, executing the same procedure as the first phoneme sequence.
- a second command with a different action Based on such a manner, the phoneme sequences in the phoneme sequence group with different semantics can be recognized by using the speech recognition model, so that the instruction corresponding to the user's intention can be executed.
- utilizing the speech recognition model to obtain the phoneme sequence representation of the speech instruction may include: utilizing the acoustic model to generate emission probabilities of the speech features of the speech instruction input to phonemes; model to recognize speech instruction input; and causing the speech recognition model to output a phoneme sequence representation. Based on this way, the corresponding phoneme sequence can be obtained from the instruction in the form of speech, so as to be used to match the phoneme sequence in the phoneme sequence group that can be recognized by the speech recognition model.
- a system for constructing a speech model comprising: a target language information acquisition unit for acquiring target language information; a synonymous phoneme sequence group acquisition unit for acquiring information related to the target language
- the model training unit for training a language model by using the synonymous phoneme sequence group to obtain a target language model
- a decoding map generating unit for generating a first decoding map according to the target language model, and the first decoding map indicates that the a plurality of decoding paths of the grammatical constraint rules determined by the synonymous phoneme sequence group
- a model determination unit configured to determine the speech recognition model based on the first decoding map.
- the speech model building system according to the seventh aspect may be implemented by, for example, a first computing device with relatively strong computing capability.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- the system may also be implemented by collaboration between the user terminal and the cloud.
- the obtained speech recognition model may, for example, be deployed to a second computing device.
- the second computing device may comprise an embedded light device, e.g., having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, bracelets, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can construct a speech recognition model with the ability to generalize the phoneme sequence associated with the target language information, so that, for example, a second computing device with less computing power can have the phoneme sequence generalization capability. Recognition ability to improve user interaction experience.
- the target language information may include speech or text.
- the target language information includes speech input from an audio collector located at the user terminal.
- the keywords of the text input are obtained from a text collector at the user terminal.
- the target language information may be some short instruction words or instruction sentences, such as "turn off”, “stop”, “pause”, “increase the volume”, “increase the volume” and so on.
- the user may, for example, directly provide voice input or text input to the user terminal, so that the user terminal can extract target language information from the voice input or text input to perform a speech recognition model Construct.
- the user can input voice or text by using an interface provided by the user terminal, for example.
- Such speech input or text input may be sent to the first computing device so that the first computing device can obtain target language information for the construction of the speech recognition model.
- the user can customize the phoneme sequence that the speech recognition model can support, so that the degree of personalization of the speech recognition model can be improved.
- the synonymous phoneme sequence group acquisition unit may be further configured to: determine the semantics of the target language information; and determine the synonymous phoneme sequence group based on at least the semantics of the target language information, wherein the synonymous phoneme sequence group The difference between the semantics of each synonymous phoneme sequence and the semantics of the target language information is less than the difference threshold.
- the first computing device can automatically expand the associated synonymous phoneme sequence group based on semantics without relying on the user's input, thereby improving and reducing the user's interaction overhead.
- the synonymous phoneme sequence group acquisition unit may be further configured to: determine a target phoneme sequence corresponding to the target language information; and determine synonyms based on the semantics of the target phoneme sequence and the length of the target phoneme sequence In the phoneme sequence group, the difference between the length of each synonymous phoneme sequence in the synonymous phoneme sequence group and the length of the target phoneme sequence is less than the length threshold.
- the length of a phoneme sequence may represent, for example, the number of phonemes (eg, initials and finals) included in the phoneme sequence.
- the target language information is text
- a phoneme sequence corresponding to the text can be obtained as the target phoneme sequence through a pronunciation dictionary.
- the target language information is speech
- the phoneme sequence of the speech can be obtained as the target phoneme sequence through an acoustic model.
- the synonymous phoneme sequences in the synonymous phoneme sequence group can be made to have similar lengths, thereby reducing the complexity of decoding and searching by using the decoding map.
- the synonymous phoneme sequence group obtaining unit may be further configured to: obtain multiple candidate synonyms based on the semantics of the target keyword corresponding to the target language information; provide multiple candidate synonyms to the user; user input received from the user, determining a synonym group from a plurality of candidate synonyms, the user input indicating that at least one candidate synonym of the plurality of candidate synonyms is excluded or confirmed; and obtaining a synonymous phoneme sequence group based on the pronunciation dictionary and the synonym group .
- the synonymous phoneme sequence group used for training the speech recognition model can be further adjusted based on user feedback, which can make the obtained speech recognition model more suitable for the user's usage habits.
- the synonymous phoneme sequence group acquisition unit may be further configured to: receive speech input from a user; and generate a synonymous phoneme sequence group based on the speech input. For example, the semantics of the speech input is obtained based on the keywords corresponding to the speech input, so as to generate the synonymous phoneme sequence group.
- the synonymous phoneme sequence group used for training the speech recognition model can be supplemented based on the user feedback in the form of speech input, which can make the obtained speech recognition model more suitable for the user's usage habits.
- the target language information includes at least first language information and second language information
- the model determining unit may be further configured to: obtain the first set of decoding paths and the second set of decoding paths from the first decoding map Decoding paths, the first set of decoding paths includes decoding paths for the first synonymous phoneme sequence group associated with the first language information, and the second set of decoding paths includes decoding paths for the second synonymous phoneme sequence group associated with the second language information. decoding paths; generating a first subgraph based on the first set of decoding paths; generating a second subgraph based on the second set of decoding paths; and determining a speech recognition model based on at least the first subgraph and the second subgraph. Based on this method, the generated decoding map has lower complexity and can support faster decoding search, thereby reducing computational overhead and storage overhead.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path and the second decoding path are decoding paths in the first synonymous phoneme sequence group
- the first The decoding path and the second decoding path have the same weight in the first subgraph.
- the target language information acquisition unit may be further configured to: acquire the first language information group according to the pre-stored historical language information and the received language information; in response to determining the language information in the first language information group The number of which exceeds a predetermined threshold, and the target language information is acquired from the first language information group based on the predetermined threshold.
- the target language information obtaining unit may be further configured to: obtain target language information from the first language information group according to attributes of the language information in the target language information, where the quantity of the target language information is a predetermined threshold . For example, the oldest created one or more historical linguistic information may be deleted from the first set of linguistic information, thereby obtaining a predetermined threshold number of linguistic information.
- the target language information acquisition unit may be further configured to: acquire target language information from the first language information group according to a user instruction, where the amount of target language information is a predetermined threshold. For example, which language information in the first language information group to retain as the target language information may be selected according to user input.
- the first computing device may further instruct to provide the speech recognition model to the target computing device (eg, the second computing device) for deployment of the speech recognition model on the target computing device. Based on this approach, automatic deployment of speech recognition models can be supported.
- a speech processing system comprising: a speech instruction input unit for receiving a speech instruction input; a speech processing unit for obtaining a phoneme sequence representation of the speech instruction input by using a speech recognition model,
- the speech recognition model is configured to perform recognition of speech instruction input based on groups of phoneme sequences that are semantically synonymous for the instructions; and to execute instructions corresponding to the phoneme sequence representations if the phoneme sequence representations correspond to phoneme sequences in the phoneme sequence group.
- the speech recognition model may be obtained by the first computing device.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- the speech processing system may be executed by a second computing device, for example.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, wristbands, watches, glasses, etc.), or in-vehicle devices.
- the embodiments of the present disclosure can enable, for example, a second computing device with relatively small computing power to have the capability of generalizing and recognizing phoneme sequences, without the need for natural language understanding by recognizing keywords, thereby reducing the performance requirements of the device , while also improving the user's voice interaction experience.
- the speech recognition model is obtained based on the following processes: obtaining target language information; obtaining a synonymous phoneme sequence group associated with the target language information, the synonymous phoneme sequence group including at least one synonymous phoneme sequence, the synonymous phoneme sequence is the phoneme sequence corresponding to the words and sentences semantically similar to the target language information; the language model is trained by using the synonymous phoneme sequence group to obtain the target language model; the first decoding map is generated according to the target language model, and the third A decoding map indicates a plurality of decoding paths that satisfy the grammar constraint rules determined based on the synonymous phoneme sequence group; and based on the first decoding map, a speech recognition model is determined.
- the constructed speech recognition model can realize the generalized recognition ability of customized target language information.
- the speech processing unit may be further configured to provide a notification of no recognition result if the phoneme sequence representation does not match any phoneme sequence in the phoneme sequence group. Based on this method, the user's voice can be recognized in real time and efficiently, which improves the user's voice interaction experience.
- the speech recognition model is configured to identify a first group of phoneme sequences having a first semantic synonym and a second group of phoneme sequences having a second semantic synonym.
- the speech processing unit may also be configured to execute the first instruction if the phoneme sequence represents a first phoneme sequence corresponding to the first phoneme sequence group, and if the phoneme sequence represents a second phoneme sequence corresponding to the second phoneme sequence group, Execute a second instruction that is different from the first action. Based on such a manner, the phoneme sequences in the phoneme sequence group with different semantics can be recognized by using the speech recognition model, so that the instruction corresponding to the user's intention can be executed.
- the speech processing unit may be further configured to: use the acoustic model to generate the emission probability of the speech feature of the speech instruction input to the phoneme; identify the speech instruction input by inputting the emission probability into the speech recognition model; and having the speech recognition model output a phoneme sequence representation. Based on this way, the corresponding phoneme sequence can be obtained from the instruction in the form of speech, so as to be used to match the phoneme sequence in the phoneme sequence group that can be recognized by the speech recognition model.
- a first computing device includes: at least one computing unit; at least one memory coupled to the at least one computing unit and storing instructions for execution by the at least one computing unit that, when executed by the at least one computing unit, cause The first computing device executes part or all of the steps of the method in the first aspect or any one of the implementations of the first aspect, or executes part or all of the steps of the method in the fifth aspect or any one of the implementations of the fifth aspect all steps.
- a second computing device includes: at least one computing unit; at least one memory coupled to the at least one computing unit and storing instructions for execution by the at least one computing unit that, when executed by the at least one computing unit, cause The second computing device performs part or all of the steps of the method in the second aspect or any one of the implementations of the second aspect, or performs part or all of the steps of the method in the sixth aspect or any one of the implementations of the sixth aspect all steps.
- a computer-readable storage medium having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processor to implement the first aspect or the first aspect
- the method in any one of the implementations of the fifth aspect, or perform some or all of the steps of the method in any one of the implementations of the fifth aspect.
- a computer-readable storage medium having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processor to implement the second aspect or the second aspect
- the method in any one of the implementation manners of the sixth aspect, or perform some or all of the steps of the method in any one of the implementation manners of the sixth aspect.
- a computer program product that, when the computer program product runs on a computer, causes the computer to execute part or all of the method in the first aspect or any one of the implementations of the first aspect step, or execute part or all of the steps of the method in the fifth aspect or any one of the implementation manners of the fifth aspect.
- a fourteenth aspect of the present disclosure there is provided a computer program product that, when the computer program product runs on a computer, causes the computer to execute part or all of the method in the second aspect or any one of the implementations of the second aspect step, or execute part or all of the steps of the method in the sixth aspect or any one of the implementation manners of the sixth aspect.
- the first computing device of the ninth aspect, the computer storage medium of the eleventh aspect, or the computer program product of the thirteenth aspect provided above are all used to execute the method provided by the first aspect. Therefore, the explanations or descriptions about the first aspect are also applicable to the ninth aspect, the eleventh aspect and the thirteenth aspect.
- the beneficial effects that can be achieved by the ninth aspect, the eleventh aspect, and the thirteenth aspect reference may be made to the beneficial effects in the corresponding methods, which will not be repeated here.
- the second computing device of the tenth aspect, the computer storage medium of the twelfth aspect, or the computer program product of the fourteenth aspect provided above are all used to execute the method provided by the second aspect. Therefore, the explanations or descriptions about the second aspect are also applicable to the tenth, twelfth and fourteenth aspects.
- the beneficial effects that can be achieved by the tenth aspect, the twelfth aspect and the fourteenth aspect reference may be made to the beneficial effects in the corresponding methods, which will not be repeated here.
- FIG. 1 shows a schematic block diagram of an example environment in which embodiments of the present disclosure may be implemented
- FIGS. 2A-2D illustrate example user interfaces in accordance with some embodiments of the present disclosure
- FIG. 3 illustrates a schematic block diagram of another example environment in which embodiments of the present disclosure may be implemented
- FIG. 4 illustrates a schematic block diagram of yet another example environment in which embodiments of the present disclosure may be implemented
- Figure 5 shows a flowchart of a process for building a speech recognition model according to some embodiments of the present disclosure
- FIG. 6 shows a flowchart of an example process for obtaining target keywords according to some embodiments of the present disclosure
- FIG. 7 illustrates a flowchart of an example process for determining synonym groups in accordance with some embodiments of the present disclosure
- FIG. 8 shows a schematic diagram of an example process for training a language model according to some embodiments of the present disclosure
- FIG. 9 shows a schematic diagram of generating a decoding map according to some embodiments of the present disclosure.
- FIG. 10 shows a schematic diagram of sub-graph clustering of a decoding graph according to some embodiments of the present disclosure
- FIG. 11 shows a schematic diagram of an example subgraph according to some embodiments of the present disclosure.
- Figure 12 shows a flowchart of an example process for speech processing in accordance with some embodiments of the present disclosure
- FIG. 13 shows a flowchart of an example process for determining a synonymous phoneme sequence group according to an embodiment of the present disclosure
- FIG. 14 shows a schematic diagram of an example process for generating a decoding map according to an embodiment of the present disclosure
- 15 shows a schematic diagram of an example process for synonymous phoneme sequence clustering according to an embodiment of the present disclosure
- Figure 16 shows a schematic diagram of an example subgraph according to some embodiments of the present disclosure
- FIG. 17 shows a flowchart of an example process for speech processing in accordance with some embodiments of the present disclosure
- FIG. 18 shows a flowchart of an example process for determining a speech recognition result according to an embodiment of the present disclosure
- FIG. 19 shows a flowchart of an example process of a speech processing method according to an embodiment of the present disclosure
- Figure 20 shows a schematic block diagram of an example speech recognition system in accordance with some specific embodiments of the present disclosure
- FIG. 21 shows a schematic block diagram of a speech model building system according to an embodiment of the present disclosure
- FIG. 22 shows a block diagram of a speech processing system according to an embodiment of the present disclosure
- FIG. 23 shows a block diagram of a speech model building system according to an embodiment of the present disclosure
- FIG. 24 shows a block diagram of a speech processing system according to an embodiment of the present disclosure
- 25 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.
- the term “comprising” and the like should be understood as open-ended inclusion, ie, “including but not limited to”.
- the term “based on” should be understood as “based at least in part on”.
- the terms “one embodiment” or “the embodiment” should be understood to mean “at least one embodiment”.
- the terms “first”, “second”, etc. may refer to different or the same objects. Other explicit and implicit definitions may also be included below.
- some traditional speech recognition schemes often rely on other devices with stronger computing power (eg, mobile phones or cloud servers, etc.) Implement semantic understanding of speech input.
- some light devices can also deploy lightweight speech recognition models to achieve local semantic understanding.
- speech recognition models cannot support user-defined semantics, and cannot handle speech input with the same semantics. This will have a great impact on the user's voice interaction experience.
- Embodiments of the present disclosure provide a keyword-based speech recognition model and a phoneme sequence-based speech recognition model.
- the keyword may be a textual representation of the language
- the phoneme sequence may be a sequence of phonetic units divided according to the natural attributes of the phoneme.
- the phonemes may be, for example, consonants, phonetic symbols, or any other form.
- FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
- the environment 100 includes a first computing device 130.
- the first computing device 130 may be, for example, a device with relatively strong computing capabilities, examples of which include, but are not limited to, cloud-side servers, smart phones, laptop computers, and tablet computers. , desktop or edge computing devices, etc.
- a first computing device 130 may also be referred to as a heavy device.
- such a first computing device 130 may include, for example, a cloud-side device.
- the first computing device 130 may acquire a target keyword, train a language model based on a synonym group semantically associated with the target keyword, and further obtain a speech recognition model.
- the target keywords may include, for example, keywords entered by the user 105 .
- user 105 may, for example, utilize user terminal 110 to configure keywords.
- the user 105 desires to configure the second computing device 150 to enable speech recognition of the keyword "raise the voice".
- the first computing device 130 may also obtain target language information, train a language model based on a synonymous phoneme sequence group associated with the target language information, and further obtain a speech recognition model.
- the target phoneme sequence may include user 105 providing speech 115, for example.
- user 105 may, for example, utilize user terminal 110 to configure groups of synonymous phoneme sequences.
- the user 105 desires, for example, to configure the second computing device 150 to enable speech recognition of the phoneme sequence corresponding to "raise the voice".
- the second computing device 150 is shown as a smart speaker, it should be understood that the smart speaker is only an example of the second computing device 150, and the second computing device 150 may also include other suitable devices, for example, Smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, etc.), smart wearable devices (eg, wristbands, watches, glasses, etc.) or in-vehicle devices, etc.
- the second computing device 150 typically has less computing power and is also referred to as a light device in this disclosure. The present disclosure is not intended to limit the specific form of the second computing device 150 .
- the user 105 may, for example, enter the user interface 200A for configuring the smart home through the user terminal 110 , and may, for example, select “speaker in the living room” to configure the smart speaker. It should be understood that, for example, the user 105 may also select other suitable smart homes for configuration. For illustrative purposes only, "Living Room Speakers" are used in the scenarios of Figures 1 and 2 as an example.
- the user terminal 110 may, for example, provide the user 105 with a user interface 200B for configuring custom keywords, so that the user 105 can modify, add or delete keywords that the smart speaker can recognize.
- the smart speaker adds, for example, three custom keywords “reduce sound”, “mute” and “switch signal source” previously created by the user 105 .
- the user 105 can view synonyms supported by the smart speaker, for example. Taking “reduce the sound” as an example, the user 105 can, for example, click “view synonyms” to view the synonyms of "reduce the sound” supported by the smart speaker, such as “reduce the sound” and “turn down the sound”, etc., as shown in FIG. 2C shown.
- the user 105 can also edit the synonym groups supported by the smart speaker, for example, to delete synonyms or modify synonyms.
- the user 105 may also configure new keywords, for example.
- the user 105 may enter the text "raise the voice", eg, through a touch screen provided by the user terminal 110 .
- the user 105 can also directly input the voice corresponding to "raise the voice” by clicking the microphone button, for example.
- the user 105 can also customize the execution action corresponding to the keyword, that is, what operation the smart speaker should perform when recognizing the keyword.
- the smart speaker can perform the action of “increase the volume of the speaker” after recognizing the keyword “increase the voice”, for example.
- the keywords obtained in the above manner and other keywords with the same semantics can be converted into corresponding phoneme sequences by using a pronunciation dictionary, so as to obtain a synonymous phoneme sequence group.
- the user terminal 110 may provide the received voice input 115 or text input 120 to a first computing device 130, eg, a cloud-side server, eg, via wireless communication.
- a first computing device 130 eg, a cloud-side server, eg, via wireless communication.
- the first computing device 130 may obtain the user-entered keyword "raise the voice" or a corresponding phoneme sequence (eg, through a pronunciation dictionary) from the speech input 115 or the text input 120, and thereby determine whether to use it for training
- the target keyword or corresponding target phoneme sequence of the language model may include, for example, default system keywords, user-defined keywords previously, and keywords that the user desires to add.
- the target keyword may, for example, only include keywords associated with the user's personalized customization, and the default system keywords may no longer be repeatedly acquired in response to the user's customization operation, so that the generated speech
- the recognition model 140 is only used to support speech recognition of user-defined keywords.
- the target phoneme sequence may include only the phoneme sequence associated with the user's personalization.
- the first computing device 130 may determine a synonym group semantically associated with the target keyword, train a language model using the target keyword and the synonym group, and further obtain a speech recognition model 140 .
- the first computing device 130 may determine a synonymous phoneme sequence group associated with the target language information, train a language model using the synonymous phoneme sequence group, and further obtain a speech recognition model 140 .
- the process of obtaining the speech recognition model 140 will be described in detail below with reference to FIG. 2 , and will not be described in detail here.
- the speech recognition model 140 obtained using the target keywords and synonyms may be deployed to smart speakers designated by the user 105 . After the smart speaker is deployed with the speech recognition model 140, it will be able to support speech recognition for target keywords and associated synonym groups.
- user 155 (which may or may not be the same user as user 105) may provide voice input 160 to the smart speaker, such as "turn up the volume.” Accordingly, the smart speaker can utilize the speech recognition model 140 to process the speech input 160 and determine a textual representation 170 corresponding to the speech input 160, ie "turn up the volume".
- text representation 170 may be text corresponding to the entire speech segment of speech input 160 , or text representation 170 may be text corresponding to a partial speech segment of speech input 160 .
- the user 155 may input "please increase the volume” by voice, and accordingly, the smart speaker may use the speech recognition model 140 to recognize the keyword “increase the volume” included therein.
- the voice input 160 received by the smart speaker may correspond to a custom keyword configured by the user 105, such as "raise your voice,” or may correspond to a synonym automatically determined by the first computing device 130, such as "Increase volume”.
- the smart speaker may perform the action corresponding to the textual representation 170 based on preconfigured rules. For example, the user 105 has previously configured the corresponding execution action as "increase speaker volume”. This allows the smart speaker to perform the action of "increasing the volume of the speaker” after recognizing the keyword "increase the volume”.
- the textual representation 170 may also trigger another device different from the smart speaker to perform a corresponding action, for example.
- the user can also configure the corresponding execution action as "increase the volume of the TV in the living room".
- the smart speaker recognizes "volume up”, it can send a command to the "living room TV” to increase the volume of the TV.
- the speech recognition model 140 obtained using sets of synonymous phoneme sequences may also be deployed to smart speakers designated by the user 105 . After the smart speaker is deployed with the speech recognition model 140, it will be able to support speech recognition for groups of synonymous phoneme sequences.
- user 155 (which may or may not be the same user as user 105) may provide voice input 160 to the smart speaker, such as "turn up the volume.” Accordingly, the smart speaker can utilize the speech recognition model 140 to process the speech input 160 and determine the phoneme sequence representation 180 of the user's speech input 160, ie, "ti sheng yin liang".
- the phoneme sequence representation 180 may be the phoneme sequence corresponding to the entire speech segment of the speech input 160 , or the phoneme sequence representation 180 may also be the phoneme sequence corresponding to a part of the speech segment of the speech input 160 . the corresponding phoneme sequence.
- the user 155 can input "qing ti sheng yin liang" by voice (please raise the volume), and accordingly, the smart speaker can use the voice recognition model 140 to recognize the phoneme sequence "ti sheng yin liang" included therein.
- the voice input 160 received by the smart speaker may correspond to a custom phoneme sequence configured by the user 105 , such as “ti gao sheng yin” (raise the voice), or may correspond to An automatically determined sequence of synonymous phonemes, such as "ti sheng yin liang” (raise volume).
- the smart speaker may perform the action corresponding to the phoneme sequence representation 180 based on preconfigured rules. For example, the user 105 has previously configured the corresponding execution action as "increase speaker volume”. This allows the smart speaker to perform the action of "increasing the volume of the speaker” after recognizing the phoneme sequence representing "ti sheng yin liang".
- the phoneme sequence representation 180 may also trigger another device different from the smart speaker to perform a corresponding action, for example.
- the user can also configure the corresponding execution action as "increase the volume of the TV in the living room".
- the smart speaker recognizes "ti sheng yin liang", it can send a command to the "living room TV” to increase the volume of the TV.
- FIG. 3 shows a schematic diagram of an example environment 300 in which various embodiments of the present disclosure can be implemented.
- environment 300 includes a first computing device 330 .
- a first computing device 330 may include, for example, a cloud-side device.
- the user 305 may configure custom keywords that the second computing device 310 is expected to be able to voice recognize, for example, by providing voice input 320 directly to the second computing device 310 to be configured.
- the second computing device 310 is shown as a smart speaker, it should be understood that the smart speaker is only an example of the second computing device 310, and the second computing device 310 may also include other suitable devices, for example, Smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, etc.), smart wearable devices (eg, wristbands, watches, glasses, etc.) or in-vehicle devices, etc.
- the speech input 320 may be sent to the first computing device 330, eg, via a wired or wireless network.
- the first computing device 330 may, for example, utilize suitable speech recognition techniques and pre-specified grammar rules to extract the keyword "raise sound” or the phoneme sequence representation "ti gao sheng yin" from the speech input 320.
- the first computing device 330 may acquire the target keyword including "raise the voice", and may further acquire the synonym group 335 associated with the target keyword speech.
- the first computing device 330 may further utilize the target keyword and the synonym 335 to obtain the speech recognition model 340 .
- the first computing device 330 may also obtain the target language input including "ti gao sheng yin", and may further obtain a synonymous phoneme sequence group (not shown in the figure) associated with the target language input.
- the first computing device 330 may further utilize the set of synonymous phoneme sequences to obtain a speech recognition model 340 .
- the speech recognition model 340 may be further deployed to the smart speaker such that the smart speaker has the ability to speech recognize the keyword "increase volume" as defined by the user 305 and the corresponding synonyms.
- user 355 (which may or may not be the same user as user 305) may provide voice input 360 to the smart speaker, eg, "turn up the volume.”
- the smart speaker can utilize the speech recognition model 340 to process the speech input 360, and determine the textual representation 370 corresponding to the speech input 360, namely "increase the volume", or determine the phoneme sequence representation 380 corresponding to the speech input 360, That is "ti sheng yin liang".
- text representation 370 may be text corresponding to the entire speech segment of speech input 360 , or text representation 370 may be text corresponding to a partial speech segment of speech input 360 .
- the user 355 may input "please increase the volume” by voice, and accordingly, the smart speaker may use the speech recognition model 340 to recognize the keyword "increase the volume” included therein.
- the voice input 360 received by the smart speaker may correspond to a custom keyword configured by the user 305, such as "raise the voice", or may correspond to a synonym automatically determined by the first computing device 330, such as "Increase volume".
- the smart speaker may also perform the action corresponding to the textual representation 370 based on preconfigured rules.
- phoneme sequence representation 380 may be a phoneme sequence corresponding to the entire speech segment of speech input 360
- phoneme sequence representation 380 may be text corresponding to a partial speech segment of speech input 360
- the user 355 can input "qing ti sheng yin liang" (please raise the volume) by voice, and accordingly, the smart speaker can use the speech recognition model 340 to recognize the included phoneme sequence "ti sheng yin liang" (raise the volume).
- the voice input 360 received by the smart speaker may correspond to a custom phoneme sequence configured by the user 305 , such as “ti gao sheng yin” (raise the voice), or may correspond to a voice input by the first computing device 330 An automatically determined sequence of synonymous phonemes, such as "ti sheng yin liang” (raise volume).
- the smart speaker may also perform actions corresponding to the phoneme sequence representation 380 based on preconfigured rules.
- FIG. 4 shows a schematic diagram of yet another example environment 400 in which various embodiments of the present disclosure can be implemented.
- environment 400 includes a first computing device 430 .
- the first computing device 430 may be, for example, a user terminal, such as a smart phone or a tablet computer.
- the user 405 can configure the second computing device 450 , for example, through an interface provided by the user terminal.
- the second computing device 450 is shown as a smart TV, but it should be understood that the smart TV is only one example of the second computing device 450, and the second computing device 450 may also include other suitable devices, For example, smart home devices (eg, air conditioners, refrigerators, washing machines, speakers, etc.), smart wearable devices (eg, wristbands, watches, glasses, etc.) or in-vehicle devices, etc.
- smart home devices eg, air conditioners, refrigerators, washing machines, speakers, etc.
- smart wearable devices eg, wristbands, watches, glasses, etc.
- in-vehicle devices etc.
- the keyword "raise the voice" or the corresponding phoneme sequence can be determined directly by the user terminal from the speech input 410 or the text input 420 without the need to The voice input 410 or the text input 420 is sent to the cloud-side device.
- the user terminal may also acquire target keywords, such target keywords may include, for example, keywords determined based on the voice input 410 or text input 420, and may also include keywords previously defined by the user, and the like.
- target keywords may include, for example, keywords determined based on the voice input 410 or text input 420, and may also include keywords previously defined by the user, and the like.
- the user terminal may further obtain a synonym group 435 semantically associated with the target keyword, and obtain a speech recognition model 440 based on a process similar to that of FIG. 1 .
- the user terminal can also acquire target language information (eg, text or voice).
- target language information may include, for example, a phoneme sequence determined based on the speech input 410 or the text input 420, and may also include a phoneme sequence previously defined by the user, and the like.
- the user terminal may further obtain a synonymous phoneme sequence group (not shown in the figure) associated with the target speech language information, and obtain a speech recognition model 440 based on a process similar to that of FIG. 1 .
- the speech recognition model 440 obtained using the target keywords and synonyms 435 may be deployed to a smart TV specified by the user 405 .
- the smart TV After the smart TV is deployed with the speech recognition model 440 , it will be able to support semantic recognition of the target keyword and the associated synonym 435 .
- user 455 (which may or may not be the same user as user 405) may provide voice input 460 to the smart television, such as "turn up the volume.”
- the smart TV can utilize the speech recognition model 440 to process the speech input 460 and determine a textual representation 470 corresponding to the speech input 460, ie "turn up the volume".
- text representation 470 may be text corresponding to the entire speech segment of speech input 460 , or text representation 470 may be text corresponding to a partial speech segment of speech input 460 .
- the user 455 may input "please increase the volume” by voice, and accordingly, the smart TV may use the speech recognition model 440 to recognize the keyword "increase the volume” included therein.
- the voice input 460 received by the smart TV may correspond to a custom keyword configured by the user 405, such as "raise the voice”, or may correspond to a synonym automatically determined by the first computing device 430, such as "Increase volume”.
- the smart TV may perform the action corresponding to the textual representation 470 based on preconfigured rules. For example, the user 405 has previously configured the corresponding execution action to be “increase the TV volume”. This enables the smart TV to perform the action of "increase the volume of the TV” after recognizing the keyword "increase the volume”.
- the textual representation 470 may also trigger another device other than the smart TV to perform a corresponding action, for example.
- the user can also configure the corresponding execution action as "increase the volume of the speakers in the living room".
- the smart TV can send a command to the "living room speaker” to increase the volume of the speaker after recognizing the "increase volume”.
- the speech recognition model 440 obtained using the synonymous phoneme sequence set may be deployed to a smart TV specified by the user 405 .
- the smart TV After the smart TV is deployed with the speech recognition model 440, it will be able to support semantic recognition for groups of synonymous phoneme sequences.
- user 455 (which may or may not be the same user as user 405) may provide voice input 460, such as "ti sheng yin liang" (volume up), to the smart TV.
- the smart TV can process the voice input 460 using the voice recognition model 440, and determine the phoneme sequence representation 480 corresponding to the voice input 460, namely "ti sheng yin liang".
- the phoneme sequence representation 480 may be text corresponding to the entire speech segment of the speech input 460 , or the phoneme sequence representation 480 may be text corresponding to a partial speech segment of the speech input 460 .
- the user 455 can input "qing ti sheng yin liang" by voice (please raise the volume), and accordingly, the smart TV can use the voice recognition model 440 to recognize the phoneme sequence "ti sheng yin liang" included therein.
- the voice input 460 received by the smart TV may correspond to a custom phoneme sequence configured by the user 405 , such as “ti gao sheng yin” (raise the voice), or may correspond to the voice input 460 received by the first computing device 430 An automatically determined sequence of synonymous phonemes, such as "ti sheng yin liang” (raise volume).
- the smart TV may perform the action corresponding to the phoneme sequence representation 480 based on preconfigured rules. For example, the user 405 has previously configured the corresponding execution action to be "increase the TV volume”. This can make the smart TV perform the action of "increase the TV volume” after recognizing the phoneme sequence "ti sheng yin liang” (increase the volume).
- the phoneme sequence representation 480 may, for example, also trigger another device different from the smart TV to perform a corresponding action.
- the user can also configure the corresponding execution action as "increase the volume of the speakers in the living room".
- the smart TV can send a command to the "living room speaker” to increase the volume of the speaker.
- a scheme for building a speech recognition model builds a speech recognition model based on keywords.
- a target keyword is acquired, and a synonym group semantically associated with the target keyword is determined.
- the language model is trained with the target keywords and synonyms to obtain the target language model.
- the target language model, acoustic model and pronunciation dictionary are combined to obtain a speech recognition model, which is a decoding graph. In this way, the embodiments of the present disclosure can construct a speech recognition model with keyword generalization recognition capability.
- FIG. 5 shows a flowchart of an example process 500 for speech processing in accordance with some embodiments of the present disclosure.
- Process 500 may be performed by, for example, the first computing device discussed above, such as first computing device 130 in FIG. 1 , first computing device 330 in FIG. 3 , or first computing device 430 in FIG. 3 .
- the process 500 may also be performed collaboratively by the terminal device and the cloud-side device.
- the process 500 is described below by taking the first computing device as an example.
- the first computing device obtains the target keyword.
- target keywords may include user-entered keywords.
- the first computing device may determine the keyword entered by the user from different types of data (e.g., textual data or audio data). The detailed process of determining the keyword input by the user will be described below with reference to FIG. 6 .
- FIG. 6 shows a flowchart of an example process 600 for determining keywords according to an embodiment of the present disclosure.
- the first computing device may obtain target keyword data.
- the first computing device may obtain speech input from an audio collector.
- the first computing device 130 may obtain the speech input 115 from the user terminal 110 .
- the first computing device 130 may obtain speech input 320, for example, from the second computing device 310 on which the speech recognition model is to be deployed.
- the first computing device 430 may be a terminal device, which may directly acquire the voice input 410 using a voice collector (eg, a microphone).
- the first computing device may also obtain text input via a text collector.
- first computing device 130 may obtain text input 120 from user terminal 110 .
- the first computing device 430 may be a terminal device, which may utilize a text collector (eg, a touch screen) to directly obtain the text input 420 .
- a text collector eg, a touch screen
- the first computing device may determine whether the type of keyword data is audio or text. If the type is text, process 600 may proceed to block 608 where the first computing device may determine the keyword directly from the text input, for example.
- process 600 may proceed to block 606 where the first computing device may, for example, utilize ASR (Automatic Speech Recognition) to recognize the speech input. Accordingly, the speech input can be converted into corresponding text. Further, at block 608 , the first computing device may determine the keyword entered by the user from the text from the speech input 110 .
- ASR Automatic Speech Recognition
- the target keywords may also include pre-stored historical keywords.
- Such historical keywords may include, for example, default system keywords. This system keyword may be specified, for example, by the manufacturer of the second computing device.
- the pre-stored historical keywords may also include user-defined historical keywords, such as the keyword "reduce the sound" shown in FIG. 2B and the like.
- the first computing device may also limit the number of keywords in the keyword group used for training the language model.
- the first computing device may determine to obtain the first keyword group based on the keywords input by the user and the pre-stored historical keywords. If the number of keywords in the first keyword group exceeds a predetermined threshold, the first computing device acquires the target keyword from the first keyword group based on the predetermined threshold. For example, the first computing device may reserve only a predetermined threshold number of keywords in the first keyword group as target keywords. Based on this method, it is possible to avoid an excessive number of keywords used for training the language model, so as to ensure that the decoding graph is lightweight, thus being applicable to devices with less computing resources.
- the first computing device may acquire the target keywords from the first keyword group according to attributes of keywords in the target keywords, and the number of target keywords is the predetermined threshold .
- attributes may include, for example, keyword type (eg, system created or user-defined) or keyword creation time.
- a predetermined threshold number of keywords may be reserved from the first keyword group according to the creation time of the keywords, so that the keywords with the earliest creation time are deleted.
- the target keywords include default system keywords
- these system keywords may always be used as target keywords.
- the first computing device may also select one or more keywords from the user-defined keywords as the target keyword according to the difference between the predetermined threshold and the number of system keywords.
- the pre-stored historical keywords may only include user-defined keywords. Accordingly, the predetermined threshold can be used to limit the number of user-defined keywords supported by the speech recognition model. In this way, if the first keyword group already includes a predetermined number of user-defined keywords, the first computing device may, for example, select a predetermined threshold number of user-defined keywords from the first keyword group as target keywords.
- the first computing device may also obtain the target keyword from the first keyword group based on the user input. Taking FIG. 2B as an example, the first computing device may, for example, allow the user to configure up to 3 custom keywords. After three custom keywords have been configured, if the user further wishes to add a new custom keyword, the user terminal may ask the user to select which keywords should be kept/deleted from the three configured custom keywords , to ensure that the number of target keywords used for training is a predetermined threshold.
- embodiments of the present disclosure can support personalized customization of the speech recognition model deployed at the second computing device.
- the first computing device obtains a synonym semantically associated with the target keyword.
- the first computing device may determine a synonym from the thesaurus based on the semantics of the keyword.
- the thesaurus may be maintained locally on the first computing device, or may be maintained at a remote device different from the first computing device.
- the first computing device may directly acquire the previously stored synonym groups without having to re-determine the synonym groups from the thesaurus.
- FIG. 7 shows a flow diagram of an example process 700 of determining synonym groups in accordance with an embodiment of the present disclosure.
- the first computing device may obtain keywords.
- the first computing device may determine the first semantics of the target keyword, eg, utilizing natural language understanding techniques.
- the first computing device may search the thesaurus for a plurality of candidate synonyms that are close to the first semantics based on the first semantics. Specifically, the difference between the determined semantics of each candidate synonym and the first semantics is smaller than a predetermined difference threshold.
- the plurality of candidate synonyms may be directly determined as synonym groups for training the language model.
- the process 700 may also include block 708, where the first computing device may screen the plurality of candidate synonyms. In some embodiments, the first computing device may perform screening based on the length difference between the candidate synonyms and the target keyword, so that the length difference between each synonym and the keyword in the determined synonym group is less than a length threshold.
- the first computing device may determine only the candidate synonyms that have the same length as the target keyword among the plurality of candidate synonyms as the synonym group to be used for training. Based on this manner, the generated decoding map can be made to have a simpler structure, so that it is more suitable for deployment in a second computing device with lower computing capability.
- the first computing device may also provide a plurality of candidate synonyms to the user, and determine a synonym group from the plurality of candidate synonyms based on user input received from the user, the user input indicating at least one of the plurality of candidate synonyms Candidate synonyms are excluded or confirmed.
- the first computing device may, for example, provide the user with the multiple candidate synonyms in a suitable manner (eg, voice announcement or display on a screen, etc.), and receive feedback information on the multiple candidate synonyms from the user.
- a suitable manner eg, voice announcement or display on a screen, etc.
- Such feedback information may, for example, indicate that at least one candidate synonym was confirmed or at least one candidate synonym was excluded.
- the user may determine the synonyms that should be retained or excluded from the multiple candidate synonyms by clicking on the displayed multiple candidate synonyms on the screen.
- the user also indicates candidate synonyms that should be retained or excluded from among the plurality of candidate words through voice input.
- the embodiments of the present disclosure can adjust the synonyms used for training the speech recognition model based on user feedback, which can make the obtained speech recognition model more in line with the user's usage habits, and avoid automatically expanding some users who do not expect synonyms.
- the first computing device may also make the number of synonyms included in the determined synonym group 135 not exceed a predetermined number. Accordingly, when there are more than a predetermined number of candidate synonyms, the first computing device may, for example, select a predetermined number of candidate synonyms with the closest semantics as the synonym group 135 .
- the first computing device trains the training model based on the target keywords and synonyms to obtain the target language model.
- the first computing device may construct a training data set for training the language model based on the target keyword and the synonym group, and obtain the target language model based on the training data set.
- FIG. 8 shows a schematic diagram 800 of an example process for training a language model in accordance with some embodiments of the present disclosure.
- a training data set 805 constructed based on target keywords and synonyms may be provided to a language model training module 810 .
- the language model training module 810 may include a feature extraction module 815 for generating input features according to the training data set 805 and providing the input features to the model training module 820 to obtain a target language model 825, the target language model 825 Grammar constraint rules determined based on target keywords and synonym groups can be indicated.
- Examples of the target language model 825 include, but are not limited to: an N-gram based N-gram model, a neural network based RNN-LM model, a regular grammar based JSGF model, etc.
- the present disclosure is not intended to carry out specific types of language models. limited.
- the first computing device generates a first decoding graph from the target language model, the first decoding graph indicating a plurality of decoding paths that satisfy the grammatical constraint rules determined based on the target keyword and synonyms.
- the target language model, the acoustic model and the pronunciation dictionary are model merged to obtain a speech recognition model, wherein the speech recognition model is a decoding graph.
- the first computing device may generate a decoding map based on the target language model 525 and the existing pronunciation dictionary and acoustic model.
- the acoustic model can be trained offline or online.
- the acoustic model may also adopt various model structures such as DNN-HMM, LSTM-HMM, TDNN-HMM, etc. The present disclosure is not intended to limit the type or training process of the acoustic model.
- the first computing device generates the decoding graph, eg, based on the HCLG (HMM+Context+Lexicon+Grammar) decoding graph construction process.
- FIG. 9 shows a schematic diagram 900 of an example process for generating a decoding map according to an embodiment of the present disclosure.
- the first computing device first uses the model merging unit 915 to merge the target language model 905 (eg, the target language model 825 in FIG. 8 ) and the pronunciation dictionary 910 to generate a merged model 1 920 .
- the target language model 905 eg, the target language model 825 in FIG. 8
- the pronunciation dictionary 910 to generate a merged model 1 920 .
- the first computing device may directly merge the merged model 1 90920 with the acoustic model 940 without considering the context-dependent phonemes.
- the first computing device may first use the model merging unit 930 to merge the model 1 920 and the context-related phonemes.
- the decoding map 950 also referred to as the HCLG decoding model, is used to indicate multiple decoding paths based on the grammatical constraint rules determined by the target keyword and the synonym group.
- the first computing device determines a speech recognition model based on the first decoding map.
- the first computing device may directly use the decoded graph 950 as the final speech recognition model.
- the target keyword may include at least a first keyword and a second keyword, for example.
- the first computing device may also perform synonym clustering on the obtained decoding graph.
- FIG. 10 shows a schematic diagram 1000 of an example process for synonym clustering according to an embodiment of the present disclosure.
- the first computing device may utilize the synonym subgraph clustering module 1020 to perform synonym clustering on the first decoded graph 1010 (eg, decoded graph 950 in FIG. 9). Specifically, the first computing device may acquire the first set of decoding paths and the second decoding path from the first decoding graph, wherein acquiring the first group of decoding paths includes the first keyword and the first keyword semantically associated with the first keyword. A decoding path corresponding to a synonym group, and a second group of decoding paths includes decoding paths corresponding to a second keyword and a second synonym group semantically associated with the second keyword. Further, the first computing device may generate the first subgraph based on the first set of decoding paths and generate the second subgraph based on the second set of decoding paths.
- FIG. 11 shows a schematic diagram of an example subgraph 1100 in accordance with some embodiments of the present disclosure. As shown in FIG. 11, subgraph 1100 includes decoding paths corresponding to the keyword "raise voice" and synonyms.
- the first computing device may determine a speech recognition model based on at least the first subgraph and the second subgraph. Specifically, the first computing device may, for example, generate a second decoding map based on the first sub-map and the second sub-map, as a speech recognition recognition model. As shown in FIG. 10 , in some embodiments, the sub-graph clustered decoding graph can be directly used as the second decoding graph 1040 and used as the final speech recognition model. When the target keyword includes multiple keywords, the generated second decoded graph 1040 may include multiple independent subgraphs corresponding to the multiple keywords.
- the first computing device may further utilize the sub-picture weight adjustment module 1030 to perform sub-picture weight adjustment on the decoded picture after sub-picture clustering. Specifically, the first computing device makes the first decoding path corresponding to the target keyword have the same weight as the second decoding path corresponding to the synonym in the synonym group to obtain the final decoding map 1040 . Taking FIG. 11 as an example, the decoding path corresponding to the target keyword "raise sound" has the same weight as the decoding path corresponding to the synonym "raise volume". Based on this method, faster decoding and searching for the expanded synonyms can be implemented, thereby reducing computational overhead and storage overhead.
- the first computing device can automatically expand the associated synonym groups based on the target keyword and construct a decoding graph for the second computing device.
- the generated decoded graph can meet the requirement of light weight, and can also enable the second computing device to have the ability to generalize and identify keywords.
- the first computing device may also instruct the target computing device (eg, the second computing device) to provide the speech recognition model for deployment on the target computing device.
- the first computing device may send the speech recognition model to the second computing device via wired or wireless communication for deployment to the second computing device.
- the first computing device may also store the model in a predetermined storage device, so that the second computing device can automatically acquire the speech recognition model from the storage device for deployment.
- a speech recognition model built a speech recognition model based on phoneme sequences.
- target language information is acquired, and a synonymous phoneme sequence group associated with the target language information is determined.
- the language model is trained with the synonymous phoneme sequence group to obtain the target language model.
- the target language model can be combined with the acoustic model to obtain a speech recognition model, which is a decoding graph. In this way, the embodiments of the present disclosure can construct a speech recognition model with the ability to generalize and recognize phoneme sequences.
- FIG. 12 shows a flowchart of an example process 1200 for speech processing in accordance with some embodiments of the present disclosure.
- Process 1200 may be performed by, for example, the first computing device discussed above, such as first computing device 130 in FIG. 1 , first computing device 330 in FIG. 3 , or first computing device 430 in FIG. 3 .
- the process 1200 may also be performed collaboratively by the terminal device and the cloud-side device.
- the process 1200 is described below by taking the first computing device as an example.
- the first computing device obtains target language information.
- the first computing device may determine the target language information from different types of data (eg, textual data or audio data).
- the first computing device may obtain speech input from the audio collector.
- the first computing device 130 may obtain the speech input 115 from the user terminal 110 .
- the first computing device 130 may obtain speech input 320, for example, from the second computing device 310 on which the speech recognition model is to be deployed.
- the first computing device 430 may be a terminal device, which may directly acquire the voice input 410 using a voice collector (eg, a microphone).
- the first computing device may also obtain text input via a text collector.
- first computing device 130 may obtain text input 120 from user terminal 110 .
- the first computing device 430 may be a terminal device, which may utilize a text collector (eg, a touch screen) to directly obtain the text input 420 .
- a text collector eg, a touch screen
- the target language information may also include pre-stored historical language information.
- Historical linguistic information may include speech or text.
- Such historical language information may include, for example, default system language information. This system language information may be specified by the manufacturer of the second computing device, for example.
- the pre-stored historical language information may also include user-defined historical language information, for example, voice or text corresponding to "reduce the voice" shown in FIG. 2B .
- the first computing device may also limit the number of phoneme sequences in the synonymous phoneme sequence group used to train the language model.
- the first computing device may determine to obtain the first language information group based on the target language information input by the user and the pre-stored historical language information. If the number of language information in the first language information group exceeds a predetermined threshold, the first computing device acquires target language information from the first language information group based on the predetermined threshold. For example, the first computing device may retain only a predetermined threshold number of linguistic information in the first linguistic information group as the target phoneme sequence. Based on this method, it is possible to avoid an excessive number of target language information for training the language model, so as to ensure that the decoding map is lightweight, thus being applicable to devices with less computing resources.
- the first computing device may obtain the target language information from the first language information group by attributes of the language information, wherein the amount of the target language information is a predetermined threshold.
- attributes may include, for example, the type of language information (eg, system-created or user-defined) or the creation time of the language information.
- a predetermined threshold number of language information may be retained from the first language information group according to the creation time of the language information, so that the language information with the earliest creation time is deleted.
- the first computing device may also select one or more language information from the user-defined language information as the target language information according to the difference between the predetermined threshold and the number of system language information.
- the pre-stored historical language information may only include user-defined language information. Accordingly, the predetermined threshold can be used to limit the number of user-defined language information supported by the speech recognition model. In this way, if the first language information group already includes a predetermined number of user-defined language information, the first computing device may, for example, select a predetermined threshold number of user-defined language information from the first language information group as the target language information.
- the first computing device may also obtain the target language information from the first language information set based on the user input. Taking FIG. 2B as an example, the first computing device may, for example, allow the user to configure up to 3 custom keywords. After three custom keywords have been configured, if the user further wishes to add a new custom keyword, the user terminal may ask the user to select which keywords should be kept/deleted from the three configured custom keywords .
- the phoneme sequences of the reserved keywords and their synonyms can be used to train the target language model. In this way, the number of linguistic information used for training can be guaranteed to be a predetermined threshold.
- embodiments of the present disclosure can support personalized customization of the speech recognition model deployed at the second computing device.
- the first computing device obtains a set of synonymous phoneme sequences associated with the target language information.
- the synonymous phoneme sequence group includes at least one synonymous phoneme sequence.
- the at least one synonymous phoneme sequence is a phoneme sequence corresponding to words and sentences semantically similar to the target language information.
- the first computing device may determine the synonymous phoneme sequence group based on the semantics of the target language information, and determine the synonymous phoneme sequence group based on the semantics of the target language information.
- the target language information in the form of speech can be mapped to keywords through a pronunciation dictionary, then synonymous keywords can be obtained from the thesaurus, and then mapped to a phoneme sequence using the pronunciation dictionary.
- the corresponding keywords can be directly obtained through word segmentation.
- the thesaurus may be maintained locally on the first computing device, or may be maintained at a remote device different from the first computing device.
- the first computing device may directly acquire the previously stored synonymous phoneme sequence group without re-determining the synonymous phoneme sequence group.
- Figure 13 shows a flowchart of an example process 1300 of determining groups of synonymous phoneme sequences according to an embodiment of the present disclosure.
- the first computing device may obtain target language information.
- the target language information may be converted into corresponding keywords.
- the first computing device may determine a first semantics of the target language information, eg, utilizing natural language understanding techniques.
- the first computing device may search for a plurality of candidate keywords that are close to the first semantics based on the first semantics, eg, through a thesaurus. Specifically, the difference between the determined semantics of each candidate synonymous keyword and the first semantics is smaller than a predetermined difference threshold.
- the first computing device may obtain phoneme sequences for the plurality of candidate keywords, eg, through a pronunciation dictionary.
- the phoneme sequences of multiple candidate keywords can be directly determined as synonymous phoneme sequence groups for training the language model.
- the process 1300 can also include block 1310, wherein the first computing device can screen the phoneme sequences of the plurality of candidate synonymous keywords.
- the first computing device may perform screening based on the difference in length between the phoneme sequence of the candidate synonymous keyword and the phoneme sequence of the target language information, so that each synonymous phoneme in the determined synonymous phoneme sequence group The difference in length between the sequence and the phoneme sequence of the target language information is less than the length threshold.
- the first computing device may determine only a candidate synonymous phoneme sequence having the same length as the target phoneme sequence among the plurality of candidate synonymous phoneme sequences as the synonymous phoneme sequence group to be used for training. Based on this manner, the generated decoding map can be made to have a simpler structure, so that it is more suitable for deployment in a second computing device with lower computing capability.
- the first computing device may also provide a plurality of candidate synonymous phoneme sequences or corresponding text to the user, and determine the synonymous phoneme sequence from the plurality of candidate synonymous phoneme sequences based on user input received from the user group, the user input indicates that at least one candidate synonymous phoneme sequence of the plurality of candidate synonymous phoneme sequences is excluded or confirmed.
- the first computing device may, for example, provide the user with the plurality of candidate synonymous phoneme sequences through an appropriate manner (eg, voice announcement or display on a screen, etc.), and receive a response to the user for the multiple candidate synonymous phoneme sequences.
- Feedback Such feedback information may, for example, indicate that at least one candidate synonymous phoneme sequence is confirmed or at least one candidate synonymous phoneme sequence is excluded.
- the user may determine a synonymous phoneme sequence that should be retained or excluded from the multiple candidate synonymous phoneme sequences by clicking on the displayed multiple candidate synonymous phoneme sequences or corresponding texts on the screen.
- the user also indicates the candidate synonymous phoneme sequences that should be retained or excluded from the plurality of candidate phoneme sequences through speech input.
- the embodiments of the present disclosure can adjust the synonymous phoneme sequence used for training the speech recognition model based on user feedback, which can make the obtained speech recognition model more in line with the user's usage habits and avoid automatic expansion of some Synonymous phoneme sequences not expected by the user.
- the first computing device may further make the number of synonymous phoneme sequences included in the determined synonymous phoneme sequence group not exceed a predetermined number.
- the first computing device may, for example, select a predetermined number of candidate synonymous phoneme sequences with the closest semantics as the synonymous phoneme sequence group.
- the first computing device trains the training model with the synonymous phoneme sequence set to obtain the target language model.
- the first computing device may construct a training data set for training the language model based on the synonymous phoneme sequence group, and obtain the target language model based on the training data set.
- the example process of training the language model is similar to the process described with reference to FIG. 8 and will not be repeated here.
- the target language model can indicate grammatical constraint rules determined based on target keywords and synonyms.
- target language models include but are not limited to: N-gram based N-gram model, neural network based RNN-LM model, regular grammar based JSGF model, etc.
- the present disclosure does not intend to limit the specific type of language model .
- the first computing device generates a first decoding map from the target language model, the first decoding map indicating a plurality of decoding paths that satisfy the syntactic constraint rules determined based on the synonymous phoneme sequence group.
- the first computing device may generate a decoding map based on the target language model and the existing acoustic model.
- the acoustic model can be trained offline or online.
- the acoustic model may also adopt various model structures such as DNN-HMM, LSTM-HMM, TDNN-HMM, etc. The present disclosure is not intended to limit the type or training process of the acoustic model.
- FIG. 14 shows a schematic diagram of an example process of generating a decoding map according to an embodiment of the present disclosure. Compared to the process shown in FIG. 9, the process 1400 does not need to use a pronunciation dictionary, and the language model 1520 is trained based on the phoneme sequence.
- the first computing device may directly merge the merged model 1435 with the acoustic model 1440 without considering the context-dependent phonemes.
- the first computing device may first use the model merging unit 1430 to analyze the language model 1420 and the context-dependent phonemes 1425 Merging is performed to generate a merged model 1435 , and the merged model 1435 and the acoustic model 1440 are then merged using a model merge unit 1445 to generate a decoded map 1450 .
- the decoding graph 1450 is used to indicate multiple decoding paths based on the grammar constraint rules determined based on the target phoneme sequence and the synonymous phoneme sequence group.
- the first computing device determines a speech recognition model based on the first decoding map.
- the first computing device may directly use the decoded graph 1450 as the final speech recognition model.
- the target phoneme sequence may, for example, include at least a first phoneme sequence and a second phoneme sequence.
- the first computing device may also perform synonymous phoneme sequence clustering on the obtained decoding graph.
- 15 shows a schematic diagram 1500 of an example process for synonymous phoneme sequence clustering according to an embodiment of the present disclosure.
- the first computing device may utilize the synonymous phoneme sequence subgraph clustering module 1520 to perform synonymous phoneme sequence clustering on the first decoding map 1510 (eg, decoding map 1450 in FIG. 14). Specifically, the first computing device may acquire the first set of decoding paths and the second decoding path from the first decoding graph, wherein acquiring the first group of decoding paths includes the first phoneme sequence and the first phoneme sequence semantically associated with the first phoneme sequence. Decoding paths corresponding to a synonymous phoneme sequence group, the second set of decoding paths including decoding paths corresponding to the second phoneme sequence and a second synonymous phoneme sequence group semantically associated with the second phoneme sequence.
- FIG. 16 shows a schematic diagram of an example submap 1600 in accordance with some embodiments of the present disclosure.
- subgraph 1600 includes decoding paths corresponding to the phoneme sequence "ti gao sheng yin" (to raise the voice) and a group of synonymous phoneme sequences.
- the first computing device may determine the speech recognition model based on at least the first subgraph and the second subgraph. Specifically, the first computing device may, for example, generate a second decoding map based on the first sub-map and the second sub-map, as a speech recognition recognition model. As shown in FIG. 15 , in some embodiments, the sub-graph clustered decoded graph can be directly used as the second decoded graph 1540 and used as the final speech recognition model. When the target phoneme sequence includes multiple phoneme sequences, the generated second decoding map 1540 may include multiple independent submaps corresponding to the multiple phoneme sequences.
- the first computing device may further utilize the sub-picture weight adjustment module 1530 to perform sub-picture weight adjustment on the decoded picture after sub-picture clustering.
- the first computing device makes the first decoding path corresponding to the target phoneme sequence have the same weight as the second decoding path corresponding to the synonymous phoneme sequence in the synonymous phoneme sequence group to obtain the final decoding map 1540 .
- the decoding path corresponding to the target phoneme sequence "ti gao sheng yin" has the same weight as the decoding path corresponding to the synonym "ti sheng yin liang" (boost the volume). Based on this method, faster decoding and searching for the expanded synonymous phoneme sequence can be implemented, thereby reducing computational overhead and storage overhead.
- the first computing device can automatically expand the associated set of synonymous phoneme sequences based on the target phoneme sequence and construct a decoding graph for the second computing device.
- the generated decoding map can not only meet the requirement of light weight, but also enable the second computing device to have the capability of generalization and recognition of phoneme sequences.
- the first computing device may also instruct the target computing device (eg, the second computing device) to provide the speech recognition model for deployment on the target computing device.
- the first computing device may send the speech recognition model to the second computing device via wired or wireless communication for deployment to the second computing device.
- the first computing device may also store the model in a predetermined storage device, so that the second computing device can automatically acquire the speech recognition model from the storage device for deployment.
- speech input is received, and a speech recognition model is used to determine a textual representation associated with the speech input, wherein the speech recognition model is obtained based on the following processes: obtaining target keywords; Synonyms associated with word semantics; use target keywords and synonyms to train a language model to obtain a target language model; and merge the target language model, acoustic model and pronunciation dictionary to obtain a speech recognition model.
- the speech recognition model is Decoding graph. Based on such a manner, the embodiments of the present disclosure can enable, for example, a computing device with less computing power to have the ability to generalize and recognize keywords, thereby improving the user's voice interaction experience.
- Process 1700 may be performed by, for example, the second computing device discussed above, such as second computing device 150 in FIG. 1 , second computing device 310 in FIG. 3 , or second computing device 450 in FIG. 3 .
- the second computing device receives speech input.
- the second computing device may receive speech input via an audio collector (eg, a microphone) local to the second computing device or an audio collector communicatively coupled to the second computing device.
- an audio collector eg, a microphone
- second computing device 150 in FIG. 1 may receive voice input 160 from user 155
- second computing device 310 in FIG. 3 may receive voice input 360 from user 355
- second computing device 310 in FIG. Computing device 450 may receive speech input 460 from user 455 .
- the second computing device utilizes a speech recognition model to determine a textual representation associated with the speech input.
- the speech recognition model is obtained by the first computing device using the keyword training data based on the process discussed above.
- the specific construction process of the speech recognition model please refer to the content described above with respect to FIG. 5 to FIG. 11 . This is no longer described in detail.
- 18 further illustrates a flowchart of an example process for determining speech recognition results in accordance with embodiments of the present disclosure. As shown in FIG. 18 , at block 1802 , the second computing device may acquire the speech signal and preprocess the signal at block 1804 .
- the second computing device may perform framed windowing on the preprocessed signal; at block 1808, the second computing device may extract features; at block 1810, the second computing device may base on the extracted features , and perform a decoding search using the deployed decoding graph; at block 18012, the second computing device may utilize the decoding graph to obtain a recognition result, ie, a textual representation or a phoneme sequence representation associated with the speech input.
- a recognition result ie, a textual representation or a phoneme sequence representation associated with the speech input.
- the textual representation may correspond to a target keyword or a synonym in a group of synonyms.
- the second computing device may also perform an action corresponding to the textual representation.
- the second computing device may query a predetermined action rule according to the determined textual representation to determine a corresponding action that the second computing device should perform.
- a second computing device eg, a smart speaker
- the second computing device may also generate a corresponding control command based on the textual representation, and send it to the third computing device, so that the third computing device performs the corresponding action.
- the second computing device 150 may be a smart speaker, and when the text indicates "turn on the TV", the smart speaker may send a power-on instruction to the corresponding smart TV, so that the smart TV is automatically turned on.
- the second computing device may process the speech input to detect the keywords included therein using a speech recognition model with keyword generalization recognition capability.
- the speech recognition model may also be obtained using phoneme sequence training data.
- the speech recognition model For the specific construction process of the speech recognition model, reference may be made to the content described above with respect to FIG. 12 to FIG. 16 , which will not be described in detail here.
- the second computing device may utilize a speech recognition model to determine a phonemic sequence representation associated with the speech input.
- the phoneme sequence representation may correspond to a target phoneme sequence or a synonymous phoneme sequence in a group of synonymous phoneme sequences.
- the second computing device may also perform an action corresponding to the phoneme sequence representation.
- the second computing device may query a predetermined action rule according to the determined phoneme sequence representation to determine a corresponding action that the second computing device should perform. For example, according to the phoneme sequence denoted as "ti sheng yin liang" (increase volume), a second computing device (eg, a smart speaker) may perform an action of raising the volume of the speaker.
- the second computing device may also generate a corresponding control command based on the phoneme sequence representation, and send it to the third computing device, so that the third computing device performs the corresponding action.
- the second computing device 150 may be a smart speaker, and when the phoneme sequence is represented as "da kai dian shi" (turn on the TV), the smart speaker may send a power-on instruction to the corresponding smart TV, so that the smart TV is automatically turned on.
- the second computing device may process the speech input to detect the phoneme sequences included therein using a speech recognition model having a generalized recognition capability of phoneme sequences.
- FIG. 19 further illustrates a flowchart of an example process 1900 of a speech processing method according to an embodiment of the present disclosure.
- Process 1900 may be performed by, for example, the second computing device discussed above, such as second computing device 150 in FIG. 1 , second computing device 310 in FIG. 3 , or second computing device 450 in FIG. 3 .
- the second computing device receives voice instruction input.
- the second computing device may receive speech input via an audio collector (eg, a microphone) local to the second computing device or an audio collector communicatively coupled to the second computing device.
- an audio collector eg, a microphone
- the second computing device utilizes the speech recognition model to obtain a phoneme sequence representation of the speech input.
- the speech recognition model is configured to identify semantically associated groups of phoneme sequences.
- the speech recognition model is obtained based on the following processes: obtaining target language information; obtaining a synonymous phoneme sequence group associated with the target language information, the synonymous phoneme sequence group comprising at least one synonymous phoneme sequence,
- the at least one synonymous phoneme sequence is a phoneme sequence corresponding to words and sentences semantically similar to the target language information;
- the language model is trained by using the synonymous phoneme sequence group to obtain the target language model;
- the first decoding map is generated according to the target language model, and the first The decoding map indicates a plurality of decoding paths satisfying the grammar constraint rules determined based on the synonymous phoneme sequence group; and based on the first decoding map, a speech recognition model is determined.
- the second computing device may provide a notification of no recognition result if the phoneme sequence representation does not match any one phoneme sequence in the set of phoneme sequences.
- No recognition results can be represented as garbage representations such as "SIL" and discarded.
- the speech recognition model is configured to identify a first group of phoneme sequences having a first associated semantics and a second group of phoneme sequences having a second associated semantics.
- Process 2000 may also include: if the phoneme sequence represents a first phoneme sequence corresponding to the first phoneme sequence group, performing a first action; and if the phoneme sequence represents a second phoneme sequence in the second phoneme sequence group, Perform a second action that is different from the first action.
- utilizing a speech recognition model to obtain a phoneme sequence representation of the speech input may include: utilizing an acoustic model to generate emission probabilities of speech features of the speech input to phonemes; identifying the speech input; and causing the speech recognition model to output a phoneme sequence representation.
- Figure 20 shows a schematic block diagram of an example speech recognition system 2000 in accordance with some specific embodiments of the present disclosure.
- the speech recognition system 2000 may include cloud-side or embedded heavy equipment 2020 , examples of which include, but are not limited to, cloud-side servers, smartphones, laptops, tablet computers, desktops, or edge computing devices.
- the cloud side or the embedded heavy device 2020 can obtain the keyword input data.
- the keyword input data may be acquired by, for example, a custom keyword input module 2015 deployed in the keyword receiving device 2010 .
- the keyword receiving device 2010 may be, for example, a different device from the cloud-side or embedded heavy equipment 2020, and transmits the keyword input data to the cloud-side or embedded heavy equipment 2020 via wired or wireless communication for transmission Communication module 2050.
- the keyword receiving device 2010 may also be, for example, an embedded light device 2055 for deploying a speech recognition model.
- the keyword receiving device 2010 may also be the same device as the cloud-side or embedded heavy device 2020 , in which case the output of the custom keyword input unit 2015 may be directly provided to the data preprocessing unit 2025 .
- the data preprocessing module 2025 can determine the custom keyword based on the received keyword input data. For example, when the keyword input data is text data, the data preprocessing module 2025 may directly determine the custom keyword based on the text input. In addition, when the keyword input data is audio data, the data preprocessing module 2025 may first convert the audio data into text data using automatic speech recognition technology, and further determine the custom keyword from the text data.
- the data preprocessing module 2025 may determine target keywords based on the custom keywords and pre-stored historical keywords.
- the synonym augmentation module 2030 may determine the synonym group associated with the target keyword from the thesaurus based on the semantics of the target keyword.
- the model training module 2035 may train a language model based on the target keywords and synonyms, and store the language model in the model repository 2040 .
- the model library 2040 may maintain, for example, already trained acoustic models, language models and pronunciation dictionaries. In some embodiments, the model library 2040 may also be maintained on a cloud server, for example.
- the trained language model may also be provided to the decoding graph building module 2045 to generate a decoding graph for the embedded light device 2055 based on the language model, the acoustic model stored in the model library 2040, and the pronunciation dictionary.
- the generated decoding map can be sent to the keyword recognition and detection unit 2060 in the embedded light device 2055 via the transmission communication module 2050 , so that the keyword recognition and detection unit 2160 can
- the received speech input is processed using a decoding graph to determine a textual representation corresponding to the speech input.
- An example speech recognition system 2100 for speech recognition based on keywords is described above.
- the present disclosure also provides an exemplary speech recognition system for speech recognition based on phoneme sequences, in which a speech recognition model is constructed using phoneme sequences rather than keywords in text form, and the constructed speech recognition model recognizes speech input For phoneme sequences, speech is recognized based on phoneme sequences rather than keywords.
- the overall structure of the speech recognition model is similar to that of the speech recognition model 2000 in FIG. 20 , and details are not repeated here.
- the speech model construction system 2100 includes a keyword acquisition unit 2110 for acquiring target keywords; a synonym acquisition unit 2120 for acquiring synonym groups semantically associated with the target keywords; a model training unit 2130 for using For training the language model using the target keyword and the synonym group to obtain the target language model; the decoding map generation unit 2140 is used to generate the first decoding map according to the target language model, and the first decoding map indicates that the target keyword and the synonym group satisfy the requirements based on the target keyword and the synonym group. multiple decoding paths of the determined grammar constraint rule; and a model determining unit 2150, configured to determine a speech recognition model based on the first decoding map.
- the target keywords include keywords of speech input from an audio collector located at the user terminal.
- the target keywords include keywords of text input from a text collector located at the user terminal.
- the synonym obtaining unit 2120 is further configured to: determine the semantics of the target keyword; and determine a synonym group based on at least the semantics of the target keyword, wherein the semantics of each synonym in the synonym group is the same as the semantics of the target keyword The difference is less than the difference threshold.
- the synonym obtaining unit 2120 is further configured to: determine a synonym group based on the semantics of the target keyword and the length of the target keyword, wherein the difference between the length of each synonym in the synonym group and the length of the target keyword is less than length threshold.
- the length of a keyword may represent, for example, the number of characters or the number of words included in the keyword.
- the synonym obtaining unit 2120 is further configured to: obtain multiple candidate synonyms based on the semantics of the target keyword; provide the user with multiple candidate synonyms; and based on user input received from the user, select from the multiple candidate synonyms A synonym group is determined, and the user input indicates that at least one candidate synonym of the plurality of candidate synonyms is excluded or confirmed.
- the target keyword includes at least a first keyword and a second keyword
- the model determining unit 1550 is further configured to: obtain the first set of decoding paths and the second set of decoding paths from the first decoding map, and the first A set of decoding paths includes decoding paths corresponding to the first keyword and a first synonym group semantically associated with the first keyword, and a second set of decoding paths includes decoding paths corresponding to the second keyword and semantically associated with the second keyword a decoding path corresponding to the second synonym group of the Identify the model.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path is a decoding path corresponding to the first keyword
- the second decoding path is a synonym corresponding to the first synonym group
- Corresponding decoding paths, the first decoding path and each second decoding path have the same weight in the first subgraph.
- the keyword obtaining unit 2110 is further configured to: obtain the first keyword group according to the pre-stored historical keywords and the received keywords; and in response to determining that the number of keywords in the first keyword group exceeds a predetermined threshold, At least one keyword in the first keyword group is deleted, and the remaining keywords in the first keyword group after deleting the at least one keyword are target keywords.
- the keyword obtaining unit 2110 is further configured to: delete at least one keyword in the target keyword according to the attribute of the keyword in the target keyword.
- the keyword obtaining unit 2110 is further configured to: delete at least one keyword in the target first keyword group according to the user's instruction.
- the speech model building system 2100 may further include a communication unit for providing the speech recognition model to the second computing device for deploying the speech recognition model on the second computing device.
- each unit in the speech model building system 2100 may be implemented using hardware units, software units, or a combination of hardware units and software units.
- the speech processing system 2200 includes a speech input unit 2210 for receiving speech input; and a speech processing unit 2220 for utilizing a speech recognition model to determine a textual representation associated with the speech input.
- the speech recognition model is obtained based on the following processes: obtaining the target keyword; obtaining a synonym group semantically associated with the target keyword; multiple decoding paths of the grammatical constraint rules determined by the synonym group; and determining a speech recognition model based on the first decoding map.
- the target keywords include keywords of speech input from an audio collector located at the user terminal.
- the target keywords include keywords of text input from a text collector located at the user terminal.
- determining the synonym group semantically associated with the target keyword comprises: determining the semantics of the target keyword; and determining the synonym group based on at least the semantics of the target keyword, wherein the semantics of each synonym in the synonym group is the same as The difference in the semantics of the target keywords is less than the difference threshold.
- determining the synonym group based on at least the semantics of the target keyword includes: determining the synonym group based on the semantics of the target keyword and the length of the target keyword, wherein the length of each synonym in the synonym group is the same as the length of the target keyword. The difference in length is less than the length threshold.
- the length of a keyword may represent, for example, the number of characters or the number of words included in the keyword.
- determining the synonym group based at least on the semantics of the target keyword includes: obtaining a plurality of candidate synonyms based on the semantics of the target keyword; providing the plurality of candidate synonyms to the user; A synonym group is determined among the plurality of candidate synonyms, and the user input indicates that at least one candidate synonym of the plurality of candidate synonyms is excluded or confirmed.
- the target keyword includes at least a first keyword and a second keyword
- determining the speech recognition model based on the first decoding map includes: obtaining a first set of decoding paths and a second set of decoding paths from the first decoding map Paths, the first set of decoding paths includes decoding paths corresponding to the first keyword and the first synonym group semantically associated with the first keyword, and the second group of decoding paths includes decoding paths corresponding to the second keyword and the second keyword decoding paths corresponding to semantically associated second synonym groups; generating a first subgraph based on the first group of decoding paths; generating a second subgraph based on the second group of decoding paths; and based on at least the first subgraph and the second subgraph to determine the speech recognition model.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path is a decoding path corresponding to the first keyword
- the second decoding path is a synonym corresponding to the first synonym group
- Corresponding decoding paths, the first decoding path and each second decoding path have the same weight in the first subgraph.
- acquiring the target keyword includes: acquiring a first keyword group according to pre-stored historical keywords and received keywords; and in response to determining that the number of keywords in the first keyword group exceeds a predetermined threshold, deleting the first keyword group At least one keyword in the keyword group, and the remaining keywords in the first keyword group after deleting the at least one keyword are the target keywords.
- deleting at least one keyword in the target keywords includes: deleting at least one keyword in the target keywords according to attributes of the keywords in the target keywords. For example, the oldest created historical keyword may be deleted according to the creation time of the historical keyword.
- deleting at least one keyword in the target first keyword group includes deleting at least one keyword in the target first keyword group according to a user instruction.
- the speech processing system 2200 may further include an action execution unit for executing an action corresponding to the text representation.
- the speech processing system 2200 may further include a device control unit for generating corresponding control commands based on the textual representation and sending them to the third computing device to cause the third computing device to perform corresponding actions.
- the textual representation corresponds to a target keyword or a synonym in a synonym group.
- each unit in the speech processing system 2200 may be implemented using hardware units, software units, or a combination of hardware units and software units.
- An example of the voice input unit 2210 may include a microphone for receiving voice input, and an example of the voice processing unit 2220 may include a processing device for performing voice recognition operations.
- speech model building system 2100 and/or speech processing system 2200 may utilize application specific integrated circuits, one or more FPGAs (Field Programmable Gate Arrays), PLDs (Programmable Logic Devices), controllers, state machines, gate logic , discrete hardware components, any other suitable circuit, or any combination of circuits capable of performing the various processes of the present disclosure, a chip, a single board, or a communication device, or the like.
- FPGAs Field Programmable Gate Arrays
- PLDs Programmable Logic Devices
- controllers state machines, gate logic , discrete hardware components, any other suitable circuit, or any combination of circuits capable of performing the various processes of the present disclosure, a chip, a single board, or a communication device, or the like.
- the speech model construction system 2300 includes: a target language information acquisition unit 2310 for acquiring target language information; a synonymous phoneme sequence group acquisition unit 2320 for acquiring synonyms associated with the target language information
- the phoneme sequence group, the synonymous phoneme sequence group includes at least one synonymous phoneme sequence, and at least one synonymous phoneme sequence is the phoneme sequence corresponding to the words and sentences semantically similar to the target language information
- the model training unit 2330 is used for using the synonymous phoneme sequence group training language model to obtain the target language model
- decoding map generation unit 2340 for generating a first decoding map according to the target language model, the first decoding map indicates that the grammatical constraint rule determined based on the synonymous phoneme sequence group is satisfied a plurality of decoding paths
- a model determining unit 2350 configured to determine the speech recognition model
- the target language information may include speech or text.
- the target language information includes speech input from an audio collector located at the user terminal.
- the keywords of the text input are obtained from a text collector at the user terminal.
- the target language information may be some short instruction words or instruction sentences, such as "turn off”, “stop”, “pause”, “increase the volume”, “increase the volume” and so on.
- the user may, for example, directly provide voice input or text input to the user terminal, so that the user terminal can extract target language information from the voice input or text input to perform a speech recognition model Construct.
- the user can input voice or text by using an interface provided by the user terminal, for example.
- Such speech input or text input may be sent to the first computing device so that the first computing device can obtain target language information for the construction of the speech recognition model.
- the user can customize the phoneme sequence that the speech recognition model can support, so that the degree of personalization of the speech recognition model can be improved.
- the synonymous phoneme sequence group obtaining unit 2320 may be further configured to: determine the semantics of the target language information; and determine a synonymous phoneme sequence group based on at least the semantics of the target language information, wherein each phoneme sequence group in the synonymous phoneme sequence group The difference between the semantics of each synonymous phoneme sequence and the semantics of the target language information is less than the difference threshold.
- the synonymous phoneme sequence group obtaining unit 2320 may also be used to: determine a target phoneme sequence corresponding to the target language information; and determine a synonymous phoneme sequence group based on the semantics of the target phoneme sequence and the length of the target phoneme sequence , the difference between the length of each synonymous phoneme sequence in the synonymous phoneme sequence group and the length of the target phoneme sequence is less than the length threshold.
- the length of a phoneme sequence may represent, for example, the number of phonemes (eg, initials and finals) included in the phoneme sequence.
- the target language information is text
- a phoneme sequence corresponding to the text can be obtained as the target phoneme sequence through a pronunciation dictionary.
- the target language information is speech
- the phoneme sequence of the speech can be obtained as the target phoneme sequence through an acoustic model.
- the synonymous phoneme sequence group obtaining unit 2320 may also be configured to: obtain multiple candidate synonyms based on the semantics of the target keyword corresponding to the target language information; provide multiple candidate synonyms to the user; The user input of , determining a synonym group from the plurality of candidate synonyms, the user input indicating that at least one candidate synonym in the plurality of candidate synonyms is excluded or confirmed; and based on the pronunciation dictionary and the synonym group, obtaining a synonymous phoneme sequence group.
- the synonymous phoneme sequence group acquisition unit 2320 may also be used to: receive speech input from a user; and generate a synonymous phoneme sequence group based on the speech input. For example, the semantics of the speech input is obtained based on the keywords corresponding to the speech input, so as to generate the synonymous phoneme sequence group.
- the target language information includes at least first language information and second language information
- the model determining unit 2330 may be further configured to: obtain the first set of decoding paths and the second set of decoding paths from the first decoding map,
- the first set of decoding paths includes decoding paths for a first synonymous phoneme sequence group associated with the first language information
- the second set of decoding paths includes decoding paths for a second synonymous phoneme sequence group associated with the second language information
- a first subgraph is generated based on the first set of decoding paths
- a second subgraph is generated based on the second set of decoding paths
- a speech recognition model is determined based on at least the first subgraph and the second subgraph.
- the first subgraph indicates a first decoding path and a second decoding path
- the first decoding path and the second decoding path are decoding paths in the first synonymous phoneme sequence group
- the first decoding path and the second decoding path The two decoding paths have the same weight in the first subgraph. Based on this method, a faster decoding search for the expanded synonymous phoneme sequence can be implemented, thereby reducing computational overhead and storage overhead.
- the target language information obtaining unit 2310 may further be configured to: obtain the first language information group according to the pre-stored historical language information and the received language information; in response to determining that the number of language information in the first language information group exceeds A predetermined threshold, and the target language information is acquired from the first language information group based on the predetermined threshold.
- the target language information obtaining unit 2310 may be further configured to: obtain target language information from the first language information group according to the attribute of the language information in the target language information, and the quantity of the target language information is a predetermined threshold. For example, the oldest created one or more historical linguistic information may be deleted from the first set of linguistic information, thereby obtaining a predetermined threshold number of linguistic information.
- the target language information obtaining unit 2310 may be further configured to: obtain target language information from the first language information group according to the user's instruction, and the quantity of the target language information is a predetermined threshold. For example, which language information in the first language information group to retain as the target language information may be selected according to user input.
- the first computing device may also instruct to provide the speech recognition model to the target computing device (eg, the second computing device) for deployment of the speech recognition model on the target computing device.
- the target computing device e.g. the second computing device
- Speech model building system 2300 may include a number of units for performing corresponding steps in process 1900 as discussed in FIG. 19 .
- the speech processing system 2400 includes: a speech instruction input unit 2410 for receiving a speech instruction input; a speech processing unit 2420 for obtaining a phoneme sequence representation of the speech instruction input by using a speech recognition model, and the speech recognition model is are configured to perform recognition of a speech instruction input based on a phoneme sequence group that is semantically synonymous with the instruction; and to execute an instruction corresponding to the phoneme sequence representation if the phoneme sequence representation corresponds to a phoneme sequence in the phoneme sequence group.
- the speech recognition model may be obtained by the first computing device.
- the first computing device may include, for example, a cloud-side or an embedded heavy device, which may have strong computing capabilities for performing the construction of the speech recognition model.
- the first computing device may also include, for example, a user terminal device.
- speech processing system 2400 may be executed by a second computing device, for example.
- the second computing device may comprise an embedded light device, eg, having less computing power, for performing speech processing with the deployed speech recognition model.
- Examples of the second computing device may include, but are not limited to, smart home devices (eg, air conditioners, refrigerators, washing machines, TVs, speakers, etc.), smart wearable devices (eg, bracelets, watches, glasses, etc.), or in-vehicle devices.
- the speech recognition model is obtained based on the following processes: obtaining target language information; obtaining a synonymous phoneme sequence group associated with the target language information, the synonymous phoneme sequence group comprising at least one synonymous phoneme sequence,
- the synonymous phoneme sequence is the phoneme sequence corresponding to the words and sentences semantically similar to the target language information;
- the language model is trained by using the synonymous phoneme sequence group to obtain the target language model;
- the first decoding map is generated according to the target language model, and the first decoding map indicates A plurality of decoding paths satisfying the grammatical constraint rules determined based on the synonymous phoneme sequence group; and based on the first decoding map, determining a speech recognition model.
- the speech processing unit 2420 may also be configured to provide a notification of no recognition result if the phoneme sequence representation does not match any phoneme sequence in the phoneme sequence group. Based on this method, the user's voice can be recognized in real time and efficiently, which improves the user's voice interaction experience.
- the speech recognition model is configured to identify a first group of phoneme sequences having a first semantic synonym and a second group of phoneme sequences having a second semantic synonym.
- the speech processing unit 2420 may also be configured to execute the first instruction if the phoneme sequence represents a first phoneme sequence in the first phoneme sequence group, and if the phoneme sequence represents a second phoneme sequence in the second phoneme sequence group , execute a second instruction different from the first action.
- the speech processing unit may be further configured to: utilize the acoustic model to generate emission probabilities of speech features of the speech instruction input to phonemes; recognize the speech instruction input by inputting the emission probabilities to the speech recognition model; and enable speech recognition The model outputs a phoneme sequence representation.
- FIG. 25 shows a schematic block diagram of an example device 2500 that may be used to implement embodiments of the present disclosure.
- a first computing device eg, first computing device 130 in FIG. 1 , first computing device 330 in FIG. 3 , or first computing device 430 in FIG. 3
- a computing device eg, second computing device 150 in FIG. 1 , second computing device 310 in FIG. 3 , or second computing device 450 in FIG. 3
- device 2500 may be implemented by device 2500 .
- device 2500 includes a central processing unit (CPU) 2501 that may be loaded into a computer in random access memory (RAM) 2503 according to computer program instructions stored in read only memory (ROM) 2502 or from storage unit 2508 Program instructions to perform various appropriate actions and processes.
- RAM random access memory
- ROM read only memory
- RAM 2503 various programs and data required for the operation of the device 2500 can also be stored.
- the CPU 2501, the ROM 2502, and the RAM 2503 are connected to each other through a bus 2504.
- Input/output (I/O) interface 2505 is also connected to bus 2404 .
- Various components in the device 2500 are connected to the I/O interface 2505, including: an input unit 2506, such as a keyboard, mouse, etc.; an output unit 2507, such as various types of displays, speakers, etc.; a storage unit 2508, such as a magnetic disk, an optical disk, etc. ; and a communication unit 2509, such as a network card, modem, wireless communication transceiver, and the like.
- the communication unit 2509 allows the device 2500 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
- processes 500 , 600 , 700 , 800 , 900 , 1000 , 1200 , 1300 , 1400 , 1500 , 1700 , 1800 , 1900 may be performed by the processing unit 2501 .
- the above-described processes may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 2508.
- part or all of the computer program may be loaded and/or installed on device 2500 via ROM 2502 and/or communication unit 2509.
- ROM 2502 and/or communication unit 2509 When a computer program is loaded into RAM 2503 and executed by CPU 2501, one or more acts of the processes described above may be performed.
- the present disclosure may be a method, apparatus, system and/or computer program product.
- the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for carrying out various aspects of the present disclosure.
- a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
- the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
- RAM random access memory
- ROM read only memory
- EPROM erasable programmable read only memory
- flash memory static random access memory
- SRAM static random access memory
- CD-ROM compact disk read only memory
- DVD digital versatile disk
- memory sticks floppy disks
- mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
- Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
- the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
- the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
- Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
- the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
- LAN local area network
- WAN wide area network
- custom electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) can be personalized by utilizing state information of computer readable program instructions.
- Computer readable program instructions are executed to implement various aspects of the present disclosure.
- These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processing unit of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
- These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium storing the instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
- Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executables for implementing the specified logical function(s) instruction.
- the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (42)
- 一种构建语音识别模型的方法,包括:获取目标关键词;获取与所述目标关键词语义相关联的同义词组;利用所述目标关键词和所述同义词组训练语言模型,以获得目标语言模型;根据所述目标语言模型生成第一解码图,所述第一解码图指示满足基于所述目标关键词和所述同义词组所确定的语法约束规则的多条解码路径;以及基于所述第一解码图,确定所述语音识别模型。
- 根据权利要求1所述的方法,其中确定与所述目标关键词语义相关联的所述同义词组包括:确定所述目标关键词的语义;以及至少基于所述目标关键词的语义确定所述同义词组,其中所述同义词组中的每个同义词的语义与所述目标关键词的语义的差异小于差异阈值。
- 根据权利要求2所述的方法,其中至少基于所述目标关键词的语义确定所述同义词组包括:基于所述目标关键词的所述语义和所述目标关键词的长度,确定所述同义词组,其中所述同义词组中的每个同义词的长度与所述目标关键词的长度的差异小于长度阈值。
- 根据权利要求2所述的方法,其中至少基于所述目标关键词的语义确定所述同义词组包括:基于所述目标关键词的语义,获取多个候选同义词;向用户提供所述多个候选同义词;以及基于从所述用户接收的用户输入,从所述多个候选同义词中确定所述同义词组,所述用户输入指示所述多个候选同义词中的至少一个候选同义词被排除或被确认。
- 根据权利要求1所述的方法,其中所述目标关键词至少包括第一关键词和第二关键词,其中基于所述第一解码图确定所述语音识别模型包括:从所述第一解码图中获取第一组解码路径和第二组解码路径,所述第一组解码路径包括与所述第一关键词和与所述第一关键词语义相关联的第一同义词组相对应的解码路径,所述第二组解码路径包括与所述第二关键词和与所述第二关键词语义相关联的第二同义词组相对应的解码路径;基于所述第一组解码路径生成第一子图;基于第二组解码路径生成第二子图;以及至少基于所述第一子图和所述第二子图来确定所述语音识别模型。
- 根据权利要求5所述的方法,其中所述第一子图指示第一解码路径和第二解码路径,所述第一解码路径为与所述第一关键词相对应的解码路径,所述第二解码路径为所述第一同义词组中的同义词相对应的解码路径,所述第一解码路径和每条所述第二解码路径在所述第一子图中具有相同的权重。
- 根据权利要求1所述的方法,其中获取所述目标关键词包括:根据预存的历史关键词和接收的关键词获取第一关键词组;以及响应于确定所述第一关键词组中的关键词数目超过预定阈值,基于所述预定阈值从所述 第一关键词组中获取所述目标关键词。
- 根据权利要求7所述的方法,基于所述预定阈值从所述第一关键词组中获取所述目标关键词包括:根据所述目标关键词中的关键词的属性从所述第一关键词组中获取所述目标关键词,所述目标关键词数量为所述预定阈值。
- 根据权利要求1所述的方法,所述方法还包括:指示向目标计算设备提供所述语音识别模型,以用于在所述目标计算设备上部署所述语音识别模型。
- 一种语音处理的方法,包括:接收语音输入;以及利用语音识别模型来确定与所述语音输入相关联的文本表示,其中所述语音识别模型是基于以下过程而被获得的:获取目标关键词;获取与所述目标关键词语义相关联的同义词组;利用所述目标关键词和所述同义词组训练语言模型,以获得目标语言模型;根据所述目标语言模型生成第一解码图,所述第一解码图指示满足基于所述目标关键词和所述同义词组所确定的语法约束规则的多条解码路径;以及基于所述第一解码图,确定所述语音识别模型。
- 根据权利要求10所述的方法,其中确定与所述目标关键词语义相关联的所述同义词组包括:确定所述目标关键词的语义;以及至少基于所述目标关键词的语义确定所述同义词组,其中所述同义词组中的每个同义词的语义与所述目标关键词的语义的差异小于差异阈值。
- 根据权利要求11所述的方法,其中至少基于所述目标关键词的语义确定所述同义词组包括:基于所述目标关键词的所述语义和所述目标关键词的长度,确定所述同义词组,其中所述同义词组中的每个同义词的长度与所述目标关键词的长度的差异小于长度阈值。
- 根据权利要求11所述的方法,其中至少基于所述目标关键词的语义确定所述同义词组包括:基于所述目标关键词的语义,获取多个候选同义词;向用户提供所述多个候选同义词;以及基于从所述用户接收的用户输入,从所述多个候选同义词中确定所述同义词组,所述用户输入指示所述多个候选同义词中的至少一个候选同义词被排除或被确认。
- 根据权利要求10所述的方法,其中所述目标关键词至少包括第一关键词和第二关键词,其中基于所述第一解码图确定所述语音识别模型包括:从所述第一解码图中获取第一组解码路径和第二组解码路径,所述第一组解码路径包括与所述第一关键词和与所述第一关键词语义相关联的第一同义词组相对应的解码路径,所述第二组解码路径包括与所述第二关键词和与所述第二关键词语义相关联的第二同义词组相对应的解码路径;基于所述第一组解码路径生成第一子图;基于第二组解码路径生成第二子图;以及至少基于所述第一子图和所述第二子图来确定所述语音识别模型。
- 根据权利要求14所述的方法,其中所述第一子图指示第一解码路径和第二解码路径,所述第一解码路径为与所述第一关键词相对应的解码路径,所述第二解码路径为所述第一同义词组中的同义词相对应的解码路径,所述第一解码路径和每条所述第二解码路径在所述第一子图中具有相同的权重。
- 根据权利要求10所述的方法,其中获取所述目标关键词包括:根据预存的历史关键词和接收的关键词获取第一关键词组;以及响应于确定所述第一关键词组中的关键词数目超过预定阈值,基于所述预定阈值从所述第一关键词组中获取所述目标关键词。
- 根据权利要求16所述的方法,基于所述预定阈值从所述第一关键词组中获取所述目标关键词包括:根据所述目标关键词中的关键词的属性从所述第一关键词组中获取所述目标关键词,所述目标关键词数量为所述预定阈值。
- 根据权利要求10所述的方法,还包括:执行与所述文本表示对应的动作。
- 根据权利要求10所述的方法,其中所述文本表示对应于所述目标关键词或所述同义词组中的一个同义词。
- 一种语音模型构建系统,包括:关键词获取单元,用于获取目标关键词;同义词获取单元,用于获取与所述目标关键词语义相关联的同义词组;模型训练单元,用于利用所述目标关键词和所述同义词组训练语言模型,以获得目标语言模型;解码图生成单元,用于根据所述目标语言模型生成第一解码图,所述第一解码图指示满足基于所述目标关键词和所述同义词组所确定的语法约束规则的多条解码路径;以及模型确定单元,用于基于所述第一解码图,确定所述语音识别模型。
- 一种语音处理系统,包括:语音输入单元,用于接收语音输入;以及语音处理单元,用于利用语音识别模型来确定与所述语音输入相关联的文本表示,其中所述语音识别模型是基于以下过程而被获得的:获取目标关键词;获取与所述目标关键词语义相关联的同义词组;利用所述目标关键词和所述同义词组训练语言模型,以获得目标语言模型;根据所述目标语言模型生成第一解码图,所述第一解码图指示满足基于所述目标关键词和所述同义词组所确定的语法约束规则的多条解码路径;以及基于所述第一解码图,确定所述语音识别模型。
- 一种构建语音识别模型的方法,包括:获取目标语言信息;获取与所述目标语言信息相关联的同义音素序列组,所述同义音素序列组包括至少一个同义音素序列,所述至少一个同义音素序列为与所述目标语言信息语义相近的词句所对应的音素序列;利用所述同义音素序列组训练语言模型,以获取目标语言模型;根据所述目标语言模型生成第一解码图,所述第一解码图指示满足基于所述同义音素序列组所确定的语法约束规则的多条解码路径;以及基于所述第一解码图,确定所述语音识别模型。
- 根据权利要求22所述的方法,其中获取与所述目标语言信息相关联的同义音素序列组包括:确定所述目标语言信息的语义;以及至少基于所述目标语言信息的语义确定所述同义音素序列组,其中所述同义音素序列组中的每个同义音素序列的语义与所述目标语言信息的语义的差异小于差异阈值。
- 根据权利要求23所述的方法,其中至少基于所述目标语言信息的语义确定所述同义音素序列组包括:确定与所述目标语言信息对应的目标音素序列;以及基于所述目标语言信息的所述语义和所述目标音素序列的长度,确定所述同义音素序列组,所述同义音素序列组中的每个同义音素序列的长度与所述目标音素序列的长度的差异小于长度阈值。
- 根据权利要求24所述的方法,其中获取与所述目标语言信息相关联的同义音素序列组包括:基于对应于所述目标语言信息的目标关键词的语义,获取多个候选同义词;向用户提供所述多个候选同义词;基于从所述用户接收的用户输入,从所述多个候选同义词中确定同义词组,所述用户输入指示所述多个候选同义词中的至少一个候选同义词被排除或被确认;以及基于发音词典和所述同义词组,获取所述同义音素序列组。
- 根据权利要求22所述的方法,其中获取与所述目标音素序列语义相关联的同义音素序列组包括:从用户接收语音输入;以及基于所述语音输入来生成所述同义音素序列组。
- 根据权利要求22所述的方法,其中所述目标语言信息至少包括第一语言信息和第二语言信息,其中基于所述第一解码图确定所述语音识别模型包括:从所述第一解码图中获取第一组解码路径和第二组解码路径,所述第一组解码路径包括与所述第一语言信息相关联的第一同义音素序列组的解码路径,所述第二组解码路径包括与所述第二语言信息相关联的第二同义音素序列组的解码路径;基于所述第一组解码路径生成第一子图;基于第二组解码路径生成第二子图;以及至少基于所述第一子图和所述第二子图来确定所述语音识别模型。
- 根据权利要求27所述的方法,其中所述第一子图指示第一解码路径和第二解码路径,所述第一解码路径和所述第二解码路径为所述第一组解码路径中的解码路径,所述第一解码路径和所述第二解码路径在所述第一子图中具有相同的权重。
- 根据权利要求22所述的方法,其中获取所述目标语言信息包括:根据预存的历史语言信息和接收的语言信息获取第一语言信息组;以及响应于确定所述第一语言信息组中的语言信息的数目超过预定阈值,基于所述预定阈值从所述第一语言信息组中获取所述目标语言信息。
- 根据权利要求29所述的方法,基于所述预定阈值从所述第一语言信息组中获取所述目标语言信息包括:根据所述第一语言信息组中的语言信息的属性从所述第一音素序列组中获取所述目标语言信息,所述目标语言信息的数量为所述预定阈值。
- 根据权利要求22所述的方法,其中,所述目标语言信息包括语音或文本。
- 根据权利要求22所述的方法,所述方法还包括:指示向目标计算设备提供所述语音识别模型,以用于在所述目标计算设备上部署所述语音识别模型。
- 一种语音处理方法,包括:接收语音指令输入;利用语音识别模型来获取所述语音指令输入的音素序列表示,所述语音识别模型被配置基于指令语义同义的音素序列组进行所述语音指令输入的识别;以及当所述音素序列表示对应于所述同义的音素序列组中的音素序列,执行与所述音素序列表示对应的指令。
- 根据权利要求33所述的方法,还包括:所述语音识别模型是基于以下过程而被获得的:获取目标语言信息;获取与所述目标语言信息相关联的同义音素序列组,所述同义音素序列组包括至少一个同义音素序列,所述至少一个同义音素序列为与所述目标语言信息语义相近的词句所对应的音素序列;利用所述同义音素序列组训练语言模型,以获取目标语言模型;根据所述目标语言模型生成第一解码图,所述第一解码图指示满足基于所述同义音素序列组所确定的语法约束规则的多条解码路径;以及基于所述第一解码图,确定所述语音识别模型。
- 根据权利要求33所述的方法,还包括:如果所述音素序列表示与所述音素序列组中的任何一个音素序列都不匹配,提供无识别结果的通知。
- 根据权利要求33所述的方法,其中所述语音识别模型被配置用于识别具有第一语义同义的第一音素序列组和具有第二语义同义的第二音素序列组,所述方法还包括:如果所述音素序列表示对应于所述第一音素序列组中的第一音素序列,执行第一指令;以及如果所述音素序列表示对应于所述第二音素序列组中的第二音素序列,执行与所述第一指令不同的第二指令。
- 根据权利要求33所述方法,其中利用语音识别模型来获取所述语音指令输入的音素序列表示包括:利用声学模型来生成所述语音指令输入的语音特征到音素的发射概率;通过将所述发射概率输入到所述语音识别模型来识别所述语音指令输入;以及使所述语音识别模型输出所述音素序列表示。
- 一种语音模型构建系统,包括:目标语言信息获取单元,用于获取目标语言信息;同义音素序列组获取单元,用于获取与所述目标语言信息相关联的同义音素序列组,所 述同义音素序列组包括至少一个同义音素序列,所述至少一个同义音素序列为与所述目标语言信息语义相近的词句所对应的音素序列;模型训练单元,用于利用所述同义音素序列组训练语言模型,以获得目标语言模型;解码图生成单元,用于根据所述目标语言模型生成第一解码图,所述第一解码图指示满足基于所述同义音素序列组所确定的语法约束规则的多条解码路径;以及模型确定单元,用于基于所述第一解码图,确定所述语音识别模型。
- 一种语音处理系统,包括:语音指令输入单元,用于接收语音指令输入;语音处理单元,用于利用语音识别模型来获取所述语音指令输入的音素序列表示,所述语音识别模型被配置基于指令语义同义的音素序列组进行所述语音指令输入的识别;以及在所述音素序列表示对应于所述音素序列组中的音素序列的情况下执行与所述音素序列表示对应的指令。
- 一种电子设备,包括:至少一个计算单元;至少一个存储器,所述至少一个存储器被耦合到所述至少一个计算单元并且存储用于由所述至少一个计算单元执行的指令,所述指令当由所述至少一个计算单元执行时,使得所述设备执行根据权利要求1至19或者权利要求22至37中任一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至19或者权利要求22至37中任一项所述的方法。
- 一种计算机程序产品,包括计算机可执行指令,其中所述计算机可执行指令在被处理器执行时实现根据权利要求1至19或者权利要求22至37中任一项所述的方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/557,213 US20240242709A1 (en) | 2021-04-27 | 2022-03-25 | Method and System for Constructing Speech Recognition Model and Speech Processing |
| CN202280004497.7A CN115668360A (zh) | 2021-04-27 | 2022-03-25 | 构建语音识别模型和语音处理的方法和系统 |
| EP22794446.9A EP4310837B1 (en) | 2021-04-27 | 2022-03-25 | Methods for constructing speech recognition model and processing speech, and system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2021/090353 WO2022226811A1 (zh) | 2021-04-27 | 2021-04-27 | 构建语音识别模型和语音处理的方法和系统 |
| CNPCT/CN2021/090353 | 2021-04-27 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022227973A1 true WO2022227973A1 (zh) | 2022-11-03 |
Family
ID=83846640
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/090353 Ceased WO2022226811A1 (zh) | 2021-04-27 | 2021-04-27 | 构建语音识别模型和语音处理的方法和系统 |
| PCT/CN2022/083190 Ceased WO2022227973A1 (zh) | 2021-04-27 | 2022-03-25 | 构建语音识别模型和语音处理的方法和系统 |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/090353 Ceased WO2022226811A1 (zh) | 2021-04-27 | 2021-04-27 | 构建语音识别模型和语音处理的方法和系统 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20240242709A1 (zh) |
| EP (1) | EP4310837B1 (zh) |
| CN (1) | CN115668360A (zh) |
| WO (2) | WO2022226811A1 (zh) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022226811A1 (zh) * | 2021-04-27 | 2022-11-03 | 华为技术有限公司 | 构建语音识别模型和语音处理的方法和系统 |
| US20240202234A1 (en) * | 2021-06-23 | 2024-06-20 | Sri International | Keyword variation for querying foreign language audio recordings |
| CN117725161A (zh) * | 2023-12-21 | 2024-03-19 | 伟金投资有限公司 | 文本中变种词的识别及提取敏感词的方法和系统 |
| CN118915544B (zh) * | 2024-07-25 | 2025-03-28 | 江苏梦想物联有限公司 | 一种基于大数据的无人机飞行智能语音控制系统及方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105764185A (zh) * | 2016-03-18 | 2016-07-13 | 深圳Tcl数字技术有限公司 | 交流驱动混合调光电路和电视机 |
| CN107066497A (zh) * | 2016-12-29 | 2017-08-18 | 努比亚技术有限公司 | 一种搜索方法和装置 |
| CN108140019A (zh) * | 2015-10-09 | 2018-06-08 | 三菱电机株式会社 | 语言模型生成装置、语言模型生成方法及其程序、语音识别装置以及语音识别方法及其程序 |
| US20200035230A1 (en) * | 2018-07-27 | 2020-01-30 | Samsung Electronics Co., Ltd. | System and method supporting context-specific language model |
| CN111933129A (zh) * | 2020-09-11 | 2020-11-13 | 腾讯科技(深圳)有限公司 | 音频处理方法、语言模型的训练方法、装置及计算机设备 |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8682661B1 (en) * | 2010-08-31 | 2014-03-25 | Google Inc. | Robust speech recognition |
| CN103325370B (zh) * | 2013-07-01 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | 语音识别方法和语音识别系统 |
| US10943583B1 (en) * | 2017-07-20 | 2021-03-09 | Amazon Technologies, Inc. | Creation of language models for speech recognition |
| CN109243428B (zh) * | 2018-10-15 | 2019-11-26 | 百度在线网络技术(北京)有限公司 | 一种建立语音识别模型的方法、语音识别方法及系统 |
| CN109377985B (zh) * | 2018-11-27 | 2022-03-18 | 北京分音塔科技有限公司 | 一种领域词的语音识别增强方法和装置 |
| CN111428476B (zh) * | 2019-01-09 | 2023-03-31 | 百度在线网络技术(北京)有限公司 | 同义词生成方法、装置、电子设备及存储介质 |
| CN110704571B (zh) * | 2019-08-16 | 2022-02-15 | 平安科技(深圳)有限公司 | 庭审辅助处理方法、审判辅助处理方法、装置、设备及介质 |
| CN110688837B (zh) * | 2019-09-27 | 2023-10-31 | 北京百度网讯科技有限公司 | 数据处理的方法及装置 |
| WO2022226811A1 (zh) * | 2021-04-27 | 2022-11-03 | 华为技术有限公司 | 构建语音识别模型和语音处理的方法和系统 |
-
2021
- 2021-04-27 WO PCT/CN2021/090353 patent/WO2022226811A1/zh not_active Ceased
-
2022
- 2022-03-25 CN CN202280004497.7A patent/CN115668360A/zh active Pending
- 2022-03-25 US US18/557,213 patent/US20240242709A1/en active Pending
- 2022-03-25 WO PCT/CN2022/083190 patent/WO2022227973A1/zh not_active Ceased
- 2022-03-25 EP EP22794446.9A patent/EP4310837B1/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108140019A (zh) * | 2015-10-09 | 2018-06-08 | 三菱电机株式会社 | 语言模型生成装置、语言模型生成方法及其程序、语音识别装置以及语音识别方法及其程序 |
| CN105764185A (zh) * | 2016-03-18 | 2016-07-13 | 深圳Tcl数字技术有限公司 | 交流驱动混合调光电路和电视机 |
| CN107066497A (zh) * | 2016-12-29 | 2017-08-18 | 努比亚技术有限公司 | 一种搜索方法和装置 |
| US20200035230A1 (en) * | 2018-07-27 | 2020-01-30 | Samsung Electronics Co., Ltd. | System and method supporting context-specific language model |
| CN111933129A (zh) * | 2020-09-11 | 2020-11-13 | 腾讯科技(深圳)有限公司 | 音频处理方法、语言模型的训练方法、装置及计算机设备 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4310837A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4310837A1 (en) | 2024-01-24 |
| EP4310837A4 (en) | 2024-07-24 |
| CN115668360A (zh) | 2023-01-31 |
| WO2022226811A1 (zh) | 2022-11-03 |
| US20240242709A1 (en) | 2024-07-18 |
| EP4310837B1 (en) | 2026-03-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR102429436B1 (ko) | 사용자의 입력 입력에 기초하여 타겟 디바이스를 결정하고, 타겟 디바이스를 제어하는 서버 및 그 동작 방법 | |
| US11887604B1 (en) | Speech interface device with caching component | |
| AU2021202694B2 (en) | Facilitating end-to-end communications with automated assistants in multiple languages | |
| CN111880645B (zh) | 基于用户的语音输入确定目标设备并控制目标设备的服务器及其操作方法 | |
| WO2022227973A1 (zh) | 构建语音识别模型和语音处理的方法和系统 | |
| US20240338860A1 (en) | Text and image generation for creation of imagery from audible input | |
| CN107134279B (zh) | 一种语音唤醒方法、装置、终端和存储介质 | |
| US11373645B1 (en) | Updating personalized data on a speech interface device | |
| CN111754978B (zh) | 韵律层级标注方法、装置、设备和存储介质 | |
| CN112466302B (zh) | 语音交互的方法、装置、电子设备和存储介质 | |
| KR102386854B1 (ko) | 통합 모델 기반의 음성 인식 장치 및 방법 | |
| EP3193328B1 (en) | Method and device for performing voice recognition using grammar model | |
| US20210151039A1 (en) | Method and apparatus for speech interaction, and computer storage medium | |
| CN105531758B (zh) | 使用外国单词语法的语音识别 | |
| CN111883121A (zh) | 唤醒方法、装置及电子设备 | |
| EP3790002B1 (en) | System and method for modifying speech recognition result | |
| CN106575293A (zh) | 孤立话语检测系统和方法 | |
| CN113611316A (zh) | 人机交互方法、装置、设备以及存储介质 | |
| WO2018153273A1 (zh) | 语义解析方法、装置及存储介质 | |
| KR20200084260A (ko) | 전자 장치 및 이의 제어 방법 | |
| CN111353035B (zh) | 人机对话方法、装置、可读存储介质及电子设备 | |
| CN112528605A (zh) | 文本风格处理方法、装置、电子设备和存储介质 | |
| CN111667815B (zh) | 用于文本到语音转换的方法、设备、芯片电路和介质 | |
| JP7204861B2 (ja) | 中国語と英語の混在音声の認識方法、装置、電子機器及び記憶媒体 | |
| CN104062910A (zh) | 命令生成装置、设备的智能控制方法和系统 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22794446 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022794446 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2022794446 Country of ref document: EP Effective date: 20231018 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18557213 Country of ref document: US |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2022794446 Country of ref document: EP |