EP4599429A1 - Adaptation dynamique de la synthèse de la parole par un assistant automatisé lors d'un ou plusieurs appels téléphoniques automatisés - Google Patents
Adaptation dynamique de la synthèse de la parole par un assistant automatisé lors d'un ou plusieurs appels téléphoniques automatisésInfo
- Publication number
- EP4599429A1 EP4599429A1 EP24837819.2A EP24837819A EP4599429A1 EP 4599429 A1 EP4599429 A1 EP 4599429A1 EP 24837819 A EP24837819 A EP 24837819A EP 4599429 A1 EP4599429 A1 EP 4599429A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- entity
- automated
- telephone call
- during
- automated assistant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/487—Arrangements for providing information services, e.g. recorded voice services or time announcements
- H04M3/493—Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
- H04M3/4936—Speech interaction details
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/50—Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
- H04M3/527—Centralised call answering arrangements not requiring operator intervention
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/39—Electronic components, circuits, software, systems or apparatus used in telephone systems using speech synthesis
Definitions
- Humans may engage in human-to-computer dialogs with interactive software applications referred to as “chatbots,” “automated assistants”, “intelligent personal assistants,” etc. (referred to herein as “automated assistants”).
- these automated assistants may correspond to a machine learning model or a combination of different machine learning models, and may be utilized to perform various tasks on behalf of users. For instance, some of these automated assistants can initiate telephone calls and conduct conversations with various human users or other automated assistants during the telephone calls to perform task(s) on behalf of the users (referred to herein as “automated telephone calls”).
- these automated assistants can cause corresponding instances of synthesized speech to be rendered at a corresponding client device of the various human users, and receive instances of corresponding responses from the various human users. Based on the instances of the synthesized speech and/or the instances of the corresponding responses, these automated assistants can determine a result of performance of the task(s), and cause an indication of the result of the performance of the task(s) to be provided for presentation to the users.
- these automated assistants typically utilize a single voice.
- these automated assistants typically utilize the same text-to-speech (TTS) model and/or the same set of prosodic properties (e.g., intonation, tone, stress, rhythm, etc.) in generating the corresponding instances of synthesized speech throughout a duration of the automated telephone calls.
- TTS text-to-speech
- prosodic properties e.g., intonation, tone, stress, rhythm, etc.
- the single voice utilized by these automated assistants is typically robotic and can be off-putting to the various human users that interact with these automated assistants during the automated telephone call. Accordingly, the likelihood of successful completion of the task(s) may be reduced, thereby resulting in wasted computational and/or network resources when performance of the task(s) by these automated assistants fail.
- Implementations described herein are directed to dynamic adaptation of speech synthesis by an automated assistant during automated telephone call(s).
- processor(s) of a system can select an initial voice to be utilized by the automated assistant in generating synthesized speech audio data and during an automated telephone call. However, during the automated telephone call, the processor(s) can determine to select an alternative voice to be utilized by the automated assistant in generating synthesized speech audio data and in continuing the automated telephone call. In additional or alternative implementations, and during the automated telephone call, the processor(s) can determine whether to generate any synthesized speech audio data that includes a unique personal identifier on a character-by-character basis or the unique personal identifier on a non- character-by-character basis. In additional or alternative implementations, and during the automated telephone call, the processor(s) can determine whether to inject pause(s) into any synthesized speech audio data that is generated.
- the processor(s) can select the initial voice based on one or more criteria, such as a type of entity to be engaged with during the automated telephone call, a particular location associated with the entity to be engaged with during the automated telephone call, whether a phone number associated with the entity to be engaged with during the automated telephone call is a landline or non-landline, and/or other criteria.
- the various voices described herein can be associated with different sets of prosodic properties that influence, for example, intonation, tone, stress, rhythm, and/or other properties of speech and how the speech is perceived.
- the processor(s) can initiate the automated telephone call with the entity. However, despite selecting the initial voice for utilization during the automated telephone call, the processor(s) can continuously monitor for one or more signals to determine whether to modify the initial voice that was selected. For example, the processor(s) can analyze content of a conversation that includes audio data capturing an interaction between the automated assistant and a representative of the entity, a transcript of the interaction between the automated assistant and a representative of the entity, and/or other content of the conversation.
- the representative of the entity can be, for example, a human representative, a voice bot, an interactive voice response (IVR) system, etc.
- the processor(s) can dynamically adapt the initial voice utilized in generating the one or more corresponding instances of synthesized speech to the alternative voice that is predicted to maximize success of the automated assistant performing a task during the automated telephone call.
- the initial voice that is selected may correspond to an accent or utilize vocabulary that is specific to a geographical region in which the entity is situated. However, upon determining that the representative associated with the entity does not reflect the accent or utilize vocabulary that is specific to the geographical region, the processor(s) can cause the automated assistant to switch to the alternative voice that better reflects that of the representative associated with the entity. Additionally, or alternatively, the initial voice that is selected may result in a first intonation or a first cadence of the synthesized speech being utilized. However, upon determining that the representative associated with the entity has a second intonation or a second cadence, the processor(s) can cause the automated assistant to switch to the alternative voice that better reflects the second intonation or the second cadence.
- the voice utilized by the automated assistant can be dynamically adapted throughout a duration of the automated telephone call to maximize success of the automated assistant performing the task during the automated telephone call.
- the voice that maximizes success of the automated assistant performing the task during the automated telephone call may be subjective, by causing the automated assistant performing the task to reflect a voice of the representative associated with the entity, the voice will sound objectively better to the representative associated with the entity.
- the processor(s) can make this determination based on whether the representative associated with the entity is a human representative or a non-human representative (e.g., a voice bot associated with the entity, an interactive voice response (IVR) system associated with the entity, etc.).
- a non-human representative e.g., a voice bot associated with the entity, an interactive voice response (IVR) system associated with the entity, etc.
- the processor(s) may determine to generate the synthesized speech audio data that includes the unique personal identifier on the non-character-by-character basis and without any pause(s), but if the unique personal identifier is not frequent in a lexicon of users, then the processor(s) may determine to generate the synthesized speech audio data that includes the unique personal identifier on the character-by-character basis and/or with pause(s).
- the processor(s) may determine to generate the synthesized speech audio data that includes the unique personal identifier on the non- character-by-character basis and without any pause(s), but if a combination of letters and/or numbers of the unique personal is relatively complex, then the processor(s) may determine to generate the synthesized speech audio data that includes the unique personal identifier on the character-by-character basis and/or with pause(s).
- the processor(s) may be more likely to render the unique personal identifier on the character-by-character basis and with pause(s).
- the processor(s) may be more likely to render the unique personal identifier on the non-character-by-character basis and without pause(s) (e.g., since the IVR system and/or the voice bot representative are likely to employ ASR model(s) to interpret any synthesized speech rendered by the automated assistant during the automated telephone call).
- FIG. 1 depicts a block diagram of an example environment that demonstrates various aspects of the present disclosure, and in which implementations disclosed herein can be implemented.
- FIG. 2 depicts an example process using various components from the example environment from FIG. 1, in accordance with various implementations.
- FIG. 3 depicts a flowchart illustrating an example method of switching voices utilized by an automated assistant during an automated telephone call, in accordance with various implementations.
- FIG. 4 depicts a flowchart illustrating an example method of dynamically adapting how unique personal identifiers are rendered during an automated telephone call, in accordance with various implementations.
- FIG. 5A and FIG. 5B depict various non-limiting examples of switching voices utilized by an automated assistant during an automated telephone call, in accordance with various implementations.
- FIG. 6A, FIG. 6B, FIG. 6C, and FIG. 6D depict various non-limiting examples of dynamically adapting how unique personal identifiers are rendered during an automated telephone call, in accordance with various implementations.
- FIG. 7 depicts an example architecture of a computing device, in accordance with various implementations.
- the user input engine 111 can detect various types of user input at the client device 110.
- the user input detected at the client device 110 can include spoken utterance(s) of a human user of the client device 110 that is detected via microphone(s) of the client device 110.
- the microphone(s) of the client device 110 can generate audio data that captures the spoken utterance(s).
- the user input detected at the client device 110 can include touch input of a human user of the client device 110 that is detected via user interface input device(s) (e.g., touch sensitive display(s)) of the client device 110, and/or typed input detected via user interface input device(s) (e.g., touch sensitive display(s) and/or keyboard(s)) of the client device 110.
- user interface input device(s) of the client device 110 can generate textual data that captures the touch input and/or the typed input.
- the rendering engine 112 can cause content and/or other output to be visually rendered for presentation to the user at the client device 110 (e.g., via a touch sensitive display or other user interface output device(s)) and/or audibly rendered for presentation to the user at the client device 110 (e.g., via speaker(s) or other user interface output device(s)).
- the content and/or other output can include, for example, a transcript of a dialog between a user of the client device 110 and an automated assistant 115 executing at least in part at the client device 110, a transcript of a dialog between the automated assistant 115 executing at least in part at the client device 110 and an additional user that is in addition to the user of the client device 110, notifications, selectable graphical elements, and/or any other content and/or output described herein.
- a transcript of a dialog between a user of the client device 110 and an automated assistant 115 executing at least in part at the client device 110 a transcript of a dialog between the automated assistant 115 executing at least in part at the client device 110 and an additional user that is in addition to the user of the client device 110, notifications, selectable graphical elements, and/or any other content and/or output described herein.
- the client device 110 is illustrated in FIG.
- the automated telephone call system 120 can be, for example, a high-performance server, a cluster of high-performance servers, and/or any other computing device that is remote from the client device 110.
- the automated telephone call system 120 includes, in various implementations, a machine learning (ML) model engine 130, a task identification engine 140, an entity identification engine 150, a voice engine 160, and a conversation engine 170.
- ML machine learning
- the ML model engine 130 can include various sub-engines, such as an automatic speech recognition (ASR) engine 131, a natural language understanding (NLU) engine 132, a fulfillment engine 133, a text-to-speech (TTS) engine 134, and a large language model (LLM) engine 135. These various sub-engines can utilize one or more respective ML models (e.g., stored in ML models database 130A).
- the voice engine 160 can include various sub-engines, such as voice selection engine 161, a voice modification engine 162, a unique personal identifier engine 163, and a pause engine 164.
- the automated telephone call system 120 can leverage various databases.
- the ML model engine 130 can the leverage ML models database 130A that stores various ML models and optionally prosodic properties database 130B that stores various sets of prosodic properties;
- the task identification engine 140 can leverage tasks database 140A that stores various tasks, parameters associated with the various tasks, entities that can be interacted with to perform the various tasks;
- the entity identification engine 150 can leverage entities database 150A that stores various entities;
- the unique personal identifier engine 163 can leverage unique personal identifiers database 163A that stores various unique personal identifiers and information associated therewith;
- the conversation engine 170 can leverage conversations database 170A that stores various conversations between users, users and automated assistants, between automated assistants, and/or other conversations.
- FIG. 1 is depicted with respect to certain engines and/or sub-engines of the automated telephone call system 120 having access to certain databases, it should be understood that is for the sake of example and is not meant to be limiting.
- the client device 110 can execute the automated telephone call system client 113.
- An instance of the automated telephone call system client 113 can be an application that is separate from an operating system of the client device 110 (e.g., installed "on top" of the operating system) - or can alternatively be implemented directly by the operating system of the client device 110.
- the automated telephone call system client 113 can implement the automated telephone call system 120 locally at the client device 110 and/or remotely from the client device 110 via one or more of the networks 199 (e.g., as shown in FIG. 1).
- the automated telephone call system client 113 (and optionally by way of its interactions with the automated telephone call system 120) may form what appears to be, from a user's perspective, a logical instance of the automated assistant 115 with which the user may engage in a human-to-computer dialog.
- An instance of the automated assistant 115 is depicted in FIG. 1, and is encompassed by a dashed line that includes the automated telephone call system client 113 of the client device 110 and the automated telephone call system 120.
- the client device 110 and/or the automated telephone call system 120 may include one or more memories for storage of data and software applications, one or more processors for accessing data and executing the software applications, and other components that facilitate communication over one or more of the networks 199.
- one or more of the software applications can be installed locally at the client device 110, whereas in other implementations one or more of the software applications can be hosted remotely from the client device 110 (e.g., by one or more servers), but accessible by the client device 110 over one or more of the networks 199.
- the automated telephone call system 120 can be utilized to dynamically adapt speech synthesis by the automated assistant 115 during automated telephone calls in an effort to conserve computational resources and/or network resources.
- the resulting synthesized speech will better resonate with a user that is consuming the synthesized speech. While what resonates with the user that is consuming the synthesized speech will depend on the subjective preferences and goals of the user, the resulting synthesized speech will be made more objectively and conveniently more relevant to the user's subjective preferences.
- the resulting synthesized speech can better reflect a voice which will resonate with the user, thereby increasing the likelihood that the automated assistant will successfully complete the task and eliminating and/or mitigating instances in which the automated assistant does not successfully complete the task due to the user being frustrated with the voice of the automated assistant or the like.
- synthesized speech that includes unique personal identifiers on a character-by-character basis (e.g., synthesized speech of "J", “o”, “h”, “n” or "John with an h” for a unique personal identifier of "John") or a non- character-by-character basis (e.g., synthesized speech of "John” for the unique personal identifier of "John” where it could be unclear whether an "h” is included based on audible rendering of "John”) based on various factors described herein (e.g., with respect to FIGS.
- the automated assistant can guide the human-to-computer interaction to a conclusion in a more quick and efficient manner, thereby conserving computational and/or network resources by eliminating and/or mitigating instances in which the automated assistant is asked to repeat one or more portions of these unique personal identifiers (e.g., "is that John with or without the 'h'” if "John” was audibly rendered instead of "J", "o", "h", “n” or "John with an h”).
- the automated assistant can guide the human-to-computer interaction to a conclusion in a more quick and efficient manner, thereby conserving computational and/or network resources by eliminating and/or mitigating instances in which the automated assistant is asked to repeat one or more portions of these unique personal identifiers (e.g., "is that John with or without the 'h'” if "John” was audibly rendered instead of "J", "o", "h", “n” or "John with an h”).
- the automated assistant can guide the human-to-computer interaction to a conclusion in a more quick and efficient manner, thereby conserving computational and/or network resources by eliminating and/or mitigating instances in which the automated assistant is asked to repeat one or more portions of these unique personal identifiers or slow down (e.g., while the user performs one or more actions based on the unique personal identifiers).
- the automated telephone calls described herein can be conducted by the automated assistant 115.
- the automated telephone calls can be conducted using Voice over Internet Protocol (VoIP), public switched telephone networks (PSTN), and/or other telephonic communication protocols.
- VoIP Voice over Internet Protocol
- PSTN public switched telephone networks
- the automated telephone calls described herein are automated in that the automated assistant 115 conducts the automated telephone calls using one or more of the components depicted in FIG. 1, on behalf of a user of the client device 110, and the user of the client device 110 is not an active participant in the automated telephone call(s).
- the TTS engine 134 can process, using TTS model(s) stored in the ML models database 130A, textual content (e.g., text formulated by the automated assistant 115) to generate synthesized speech audio data that includes computergenerated synthesized speech.
- the LLM engine 135 can replace one or more of the aforementioned components. For instance, the LLM engine 135 can replace the NLU engine 132 and/or the fulfillment engine 133.
- the system determines a result of the automated telephone call with the entity.
- the system can determine the result of the automated telephone call based on one or more corresponding instances of audio data that are received from a representative associated with the entity.
- the result of the automated telephone call can be, for example, an indication of whether the task was successfully performed, details associated with performance of the task, and/or any other result of the automated telephone call. It should be noted that the result of the automated telephone call may vary depending on the task being performed during the automated telephone call.
- the system can cause the unique personal identifier engine 163 to interact with the unique personal identifiers database 163A to determine whether the frequency of the unique personal identifier satisfies a frequency threshold.
- the system can determine to generate the synthesized speech audio data that includes the unique personal identifier on the character-by-character basis in response to determining that the frequency of the unique personal identifier fails to satisfy the frequency threshold.
- the system can determine to generate the synthesized speech audio data that includes the unique personal identifier on the non-character-by-character basis in response to determining that the frequency of the unique personal identifier satisfies the frequency threshold.
- the system may determine to generate the synthesized speech audio data that includes the unique personal identifier on the non-character-by- character basis, but if the unique personal identifier does include characters beyond the threshold length, then the system may determine to generate the synthesized speech audio data that includes the unique personal identifier on the character-by-character basis.
- the system can determine whether to generate the synthesized speech that injects the one or more pauses based on the same or similar criteria, but by leveraging the pause engine 164. For example, the system can cause the pause engine 164 to consider the frequency of the unique personal identifier, the length of the unique personal identifier, the complexity of the unique personal identifier, and/or other criteria in determining whether to inject the one or more pauses into the synthesized speech that includes the unique personal identifier.
- the frequency of the unique personal identifier failing to satisfy the frequency threshold may result in the one or more pauses being injected into the synthesized speech
- the length of the unique personal identifier satisfying the length threshold may result in the one or more pauses being injected into the synthesized speech
- the complexity of the unique personal identifier satisfying the complexity threshold may result in the one or more pauses being injected into the synthesized speech.
- the system can cause the pause engine 164 to consider the type of representative associated with the entity that the automated assistant is interacting with in determining whether to inject the one or more pauses into the synthesized speech that includes the unique personal identifier.
- the automated assistant interacting with the human representative may result in the one or more pauses being injected into the synthesized speech (e.g., since the human is likely to record and/or otherwise act upon the unique personal identifier). Accordingly, it should be understood that not only can these particular criteria influence whether the unique personal identifier is rendered on the character-by-character basis, but can also influence whether the one or more pauses are injected into the synthesized speech and where the one or more pauses are injected into the synthesized speech.
- the system determines to generate synthesized speech that includes the unique personal identifier on a character-by-character basis, then the system proceeds to block 460.
- the system processes, using a text-to-speech (TTS) model, the textual content to generate the synthesized speech that includes the unique personal identifier on the character-by-character basis and optionally with the one or more pauses.
- TTS text-to-speech
- the system determines to generate synthesized speech that includes the unique personal identifier on a non-character-by-character basis, then the system proceeds to block 462.
- the system processes, using a text-to-speech (TTS) model, the textual content to generate the synthesized speech that includes the unique personal identifier on the non-character-by-character basis and optionally with the one or more pauses.
- TTS text-to-speech
- the system causes the synthesized speech to be audibly rendered for presentation to the representative associated with the entity.
- the system can cause the synthesized speech to be audibly rendered via speaker(s) associated with a client device of the representative associated with the entity over one or more networks (e.g., PSTN, VoIP, etc.).
- the system returns to block 456 and continues with the method 400.
- multiple iterations of the method 400 can be performed in a parallel manner and/or a serial manner for different unique personal identifiers that are to be rendered during the automated telephone call.
- the method 400 of FIG. 4 is not described with respect to selecting an initial voice and dynamically switching to an alternative voice utilized by an automated assistant during the automated telephone call (e.g., as described with respect to FIGS. 2 and 3), it should be understood that is for the sake of example and is not meant to be limiting. Rather, it should be understood that the method 400 of FIG. 4 is described herein to illustrate some techniques contemplated herein.
- FIGS. 5A and 5B each depict a client device 110 (e.g., an instance of the client device 110 from FIG. 1) having a display 180.
- client device 110 e.g., an instance of the client device 110 from FIG. 1
- One or more aspects of an automated assistant associated with the client device 110 may be implemented locally on the client device 110 and/or on other client device(s) that are in network communication with the client device 110 in a distributed manner (e.g., via network(s) 199 of FIG. 1).
- FIGS. 5A and 5B are described herein as being performed by the automated assistant 115.
- the client device 110 of FIGS. 5A and 5B is depicted as a mobile phone, it should be understood that is not meant to be limiting.
- the client device 110 can be, for example, a stand-alone assistant device (e.g., with speaker(s) and/or a display), a laptop, a desktop computer, a wearable computing device (e.g., a smart watch, smart headphones, etc.), a vehicular computing device, and/or any other client device capable of making telephonic calls.
- the display 180 of the client device 110 in FIGS. 5A and 5B further includes a textual input interface element 184 that the user may select to generate user input via a keyboard (virtual or real) or other touch and/or typed input, and a spoken input interface element 185 that the user may select to generate user input via microphone(s) of the client device 110.
- the user may generate user input via the microphone(s) without selection of the spoken input interface element 185.
- active monitoring for audible user input via the microphone(s) may occur to obviate the need for the user to select the spoken input interface element 185.
- the spoken input interface element 185 may be omitted.
- the textual input interface element 184 may additionally and/or alternatively be omitted (e.g., the user may only provide audible user input).
- the display 180 of the client device 110 in FIGS. 5A and 5B also includes system interface elements 181, 182, 183 that may be interacted with by the user to cause the client device 110 to perform one or more actions.
- a user of the client device 110 directs user input of "Call Example Italian Restaurant to see if they have gabagool and make me a reservation for tonight at 8:00 PM for two people if they do".
- the task to be performed can be considered: (1) call “Example Italian Restaurant”; (2) inquire about availability of the gabagool at "Example Italian Restaurant”; and (3) make reservation for tonight at 8:00 PM for two people if "Example Italian Restaurant” has the gabagool.
- the automated assistant can identify the entity as indicated by 552A1 (e.g., "Example Italian Restaurant") and based on the user input.
- the automated assistant can initially select a Southeastern US voice as indicated by 552A2 anticipating that an employee of "Example Italian Restaurant” will have a Southeastern US accent and/or vocabulary. Moreover, the automated assistant can initiate the automated telephone call with "Example Italian Restaurant” as indicated by 552A3 to perform the task.
- Example Italian Restaurant has a voice bot that utilizes an Italian US voice as indicated by 554A1 and the voice bot plays a greeting 554A2 of "Ciao, thank you for calling Example Italian Restaurant, please tell me how I can be of assistance today.” Accordingly, and in response to analyzing the greeting 554A2, the automated assistant can determine that the voice bot associated with "Example Italian Restaurant” utilizes an Italian US voice. In this example, and prior to causing any synthesized speech to be rendered, the automated assistant can determine to switch to an Italian US voice as indicated by 556A1 and then cause synthesized speech 556A2 of "Ahh Ciao, I was wondering if you all have the gabagool?".
- the voice bot plays a response 558A1 of "We are Example Italian Restaurant, of course we have the gabagool”. Accordingly, and in response to analyzing the response 558A1, the automated assistant can then cause synthesized speech 562A1 of "In that case, please transfer me to the hostess to make a reservation.”
- the automated assistant can select the initial voice (e.g., the Southeastern US voice as indicated by 552A2) which it anticipates will result in the task being performed in a quick and efficient manner, such as the Southeastern US voice based on the user and "Example Italian Restaurant" being located in the Southeastern US.
- the automated assistant can dynamically adapt to the alternative voice (e.g., the Italian US voice as indicated by 556A1) after hearing the greeting 554A2 provided by the voice bot associated with "Example Italian Restaurant”.
- the alternative voice e.g., the Italian US voice as indicated by 556A1
- the voice bot associated with "Example Italian Restaurant” transfers the call to a human hostess associated with "Example Italian Restaurant".
- the human hostess has a Southeastern US voice as indicated by 562A1 and provides an additional greeting 562A2 of "Hey there, thanks for calling Example Italian Restaurant, what day and time would you like to make the reservation?"
- the automated assistant can determine that the human hostess associated with "Example Italian Restaurant” utilizes a Southeastern US voice.
- synthesized speech has already been rendered as part of the automated telephone call (e.g., the synthesized speech 556A2 and 560A1 from FIG.
- the automated assistant can determine to switch back to the Southeastern US voice as indicated by 564A1 and then cause synthesized speech 564A2 of "Hello, do you have any availability for tonight at 8:00 PM for two people?
- the name for the reservation is Todd" (e.g., where the user's surname is "Todd”).
- the human hostess provides an additional response 568A1 of "We certainly do, see you tonight! to indicate that the reservation was successfully made on behalf of the user.
- the automated assistant can switch back from the alternative voice (e.g., the Italian US voice as indicated by 556A1) which it anticipates will result in the task being performed in a quick and efficient manner, such as the Italian US voice based on analyzing the greeting 554A2.
- the automated assistant can dynamically adapt back to the initial voice (e.g., the Southeastern US voice as indicated by 564A1) after hearing the additional greeting 562A2 provided by the human hostess associated with "Example Italian Restaurant".
- the initial voice e.g., the Southeastern US voice as indicated by 564A1
- FIGS. 5A and 5B is described with respect to selecting the initial voice, switching to the alternative voice, and then switching back to the initial voice throughout a duration of the automated telephone call, it should be understood that is for the sake of example and is not meant to be limiting. Rather, it should be understood that the example of FIGS. 5A and 5B is provided to illustrate various techniques contemplated herein (e.g., as described with respect to FIGS. 2 and 3). Further, although the example of FIGS. 5A and 5B is described with respect to different accents and/or vocabs corresponding to the different voices based on geographical regions, nationalities, etc., it should be understood that is for the sake of example and is not meant to be limiting.
- the alternative voice can have the same accent and/or vocab as the initial voice, but different prosodic properties may be utilized to change the rhythm, tempo, etc. of the synthesized speech to speed up, slow down, alter emphasis, etc. of the synthesized speech.
- FIGS. 5A and 5B is not described with respect to determining how to render unique personal identifiers (e.g., the user's surname of "Todd"), it should be understood that is for the sake of example and is not meant to be limiting.
- FIGS. 6A, 6B, 6C, and 6D each depict a client device 110 (e.g., an instance of the client device 110 from FIG. 1) having a display 180.
- client device 110 e.g., an instance of the client device 110 from FIG. 1
- an automated assistant associated with the client device 110 e.g., an instance of the automated assistant 115 from FIG. 1 may be implemented locally on the client device 110 and/or on other client device(s) that are in network communication with the client device 110 in a distributed manner (e.g., via network(s) 199 of FIG. 1).
- FIGS. 1 operations of FIGS.
- the client device 110 of FIGS. 6A, 6B, 6C, and 6D is depicted as a mobile phone, it should be understood that is not meant to be limiting.
- the client device 110 can be, for example, a stand-alone assistant device (e.g., with speaker(s) and/or a display), a laptop, a desktop computer, a wearable computing device (e.g., a smart watch, smart headphones, etc.), a vehicular computing device, and/or any other client device capable of making telephonic calls.
- the display 180 of the client device 110 in FIGS. 6A, 6B, 6C, and 6D further includes a textual input interface element 184 that the user may select to generate user input via a keyboard (virtual or real) or other touch and/or typed input, and a spoken input interface element 185 that the user may select to generate user input via microphone(s) of the client device 110.
- the user may generate user input via the microphone(s) without selection of the spoken input interface element 185.
- active monitoring for audible user input via the microphone(s) may occur to obviate the need for the user to select the spoken input interface element 185.
- the spoken input interface element 185 may be omitted.
- the textual input interface element 184 may additionally and/or alternatively be omitted (e.g., the user may only provide audible user input).
- the display 180 of the client device 110 in FIGS. 6A, 6B, 6C, and 6D also includes system interface elements 181, 182, 183 that may be interacted with by the user to cause the client device 110 to perform one or more actions. [0083] Referring specifically to FIG. 6A, for the sake of example assume that a user of the client device 110 directs user input of "Call Example Italian Restaurant and make me a reservation for tonight at 8:00 PM for two people".
- the task to be performed can be considered: (1) call "Example Italian Restaurant”; and (2) make reservation for tonight at 8:00 PM for two people.
- the automated assistant can identify the entity as indicated by 652A1 (e.g., "Example Italian Restaurant") and based on the user input.
- the automated assistant can initiate the automated telephone call with "Example Italian Restaurant” as indicated by 652A2 to perform the task.
- Example Italian Restaurant has a human hostess that answers the automated telephone call and provides a greeting 654A1 of "Ciao, thank you for calling Example Italian Restaurant, please tell me how I can be of assistance today.” Accordingly, and in response to analyzing the greeting 654A1, the automated assistant can determine that the human hostess is, in fact, a human. Nonetheless, further assume that the automated assistant causes synthesized speech 656A1 of "Hello, I would like to make a reservation for tonight at 8:00 PM for two people, do you have any availability? The name for the reservation is Todd" (e.g., where the user's surname is "Todd”). Further assume that the human hostess provides a response 658A1 of "Your reservation is set, see you tonight! to indicate that the reservation was successfully made on behalf of the user.
- the user's surname of "Todd” can be considered a unique personal identifier for the user of the client device 110.
- the unique personal identifier in the synthesized speech 656A1 is generated on a non-character-by-character basis and does not include any pauses.
- the automated assistant can determine to generate the synthesized speech 656A1 on the non-character-by-character basis based on, for example, the name "Todd” being a relatively frequent or common name in a lexicon of users in the US, the name “Todd” being of a relatively short length, the name “Todd” being relatively uncomplex, and/or based on other criteria (e.g., as described with respect to FIG. 4).
- the synthesized speech 656A1 can be rendered on the non-character-by-character basis and without any pauses to conclude the interaction in a more quick and efficient manner.
- the automated assistant may determine to generate the synthesized speech 656B1 on a character- by-character basis.
- the automated assistant can determine to generate the synthesized speech 656B1 on the character-by-character basis based on, for example, the name "Carlsen” being a relatively infrequent or uncommon name in a lexicon of users in the US (e.g., compared to users in, for example, Scandinavian countries), the name “Carlsen” being of a relatively longer length, the name “Carlsen” being complex (e.g., whether "Carlsen” ends with "-sen” or "-son”), and/or based on other criteria (e.g., as described with respect to FIG. 4).
- the synthesized speech 656B1 can be rendered on the character-by-character basis and with pauses to conclude the interaction in a more quick and efficient manner by mitigating and/or eliminating instances in which the automated assistant may be asked to repeat the unique personal identifier and/or a character of the unique personal identifier.
- the automated assistant may determine not only to generate the synthesized speech 656C1 on a character-by-character basis, but also to inject one or more pauses into the synthesized speech 656C1.
- the automated assistant can determine to generate the synthesized speech 656B1 on the character-by-character basis based on, for example, the name "Carlsen” being a relatively infrequent or uncommon name in a lexicon of users in the US (e.g., compared to users in, for example, Scandinavian countries), the name “Carlsen” being of a relatively longer length, the name “Carlsen” being complex (e.g., whether "Carlsen” ends with "-sen” or "-son”), and/or based on other criteria (e.g., as described with respect to FIG. 4).
- the automated assistant can determine to generate the synthesized speech 656C1 with the one or more pauses based on, for example, the name "Carlsen” being a relatively infrequent or uncommon name in a lexicon of users in the US (e.g., compared to users in, for example, Scandinavian countries), the name “Carlsen” being of a relatively longer length, the name “Carlsen” being complex (e.g., whether "Carlsen” ends with “-sen” or “-son”), and/or based on other criteria (e.g., as described with respect to FIG. 4).
- the automated assistant may determine to generate synthesized speech including the unique personal identifier of "Carlsen” without on a non- character-by-character basis and without injecting any pauses in certain scenarios.
- the synthesized speech 656C1 can be rendered on the character-by-character basis and with pauses to conclude the interaction in a more quick and efficient manner by mitigating and/or eliminating instances in which the automated assistant may be asked to repeat the unique personal identifier and/or a character of the unique personal identifier.
- the user's surname is "Carlsen” instead of "Todd”.
- a voice bot associated with "Example Italian Restaurant” answers the automated telephone call.
- the voice bot provides a greeting 654D1 of "Ciao, thank you for calling Example Italian Restaurant, please tell me how I can be of assistance today.” Accordingly, and in response to analyzing the greeting 654D1, the automated assistant can determine that the voice bot is, in fact, not a human. Nonetheless, further assume that the automated assistant causes synthesized speech 656D1 of "Hello, I would like to make a reservation for tonight at 8:00 PM for two people, do you have any availability? The name for the reservation is Carlsen”. Further assume that the voice bot provides a response 658D1 of "Your reservation is set, see you tonight! to indicate that the reservation was successfully made on behalf of the user.
- the unique personal identifier in the synthesized speech 656D1 is generated on a non-character-by-character basis and does not include any pauses.
- the automated assistant can determine to generate the synthesized speech 656D1 on the non-character-by-character basis based on, for example, the representative associated with the entity being voice bot that utilizes ASR model(s) in processing the synthesized speech 656D1, such that the ASR model(s) are likely to correctly interpret the unique personal identifier and that the pauses are more likely to cause confusion that make it easier for the voice bot to interpret the unique personal identifier.
- the synthesized speech 656D1 can be rendered on the non-character-by-character basis and without any pauses to conclude the interaction in a more quick and efficient manner.
- FIGS. 6A-6D are described with respect to certain examples, it should be understood that those examples are described herein to illustrate various techniques contemplated herein and are not meant to be limiting. Rather, it should be understood that the techniques described herein can be adapted to different tasks that the user requests the automated assistant to perform, based on audio data that captures inputs of a representative associated with an entity, and so on.
- FIG. 7 a block diagram of an example computing device 710 that may optionally be utilized to perform one or more aspects of techniques described herein.
- a client device, remote system component(s), and/or other component(s) may comprise one or more components of the example computing device 710.
- Computing device 710 typically includes at least one processor 714 which communicates with a number of peripheral devices via bus subsystem 712. These peripheral devices may include a storage subsystem 724, including, for example, a memory subsystem 725 and a file storage subsystem 726, user interface output devices 720, user interface input devices 722, and a network interface subsystem 716. The input and output devices allow user interaction with computing device 710.
- Network interface subsystem 716 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.
- User interface output devices 720 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
- the display subsystem may also provide non-visual display such as via audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computing device 710 to the user or to another machine or computing device.
- Memory 725 used in the storage subsystem 724 can include a number of memories including a main random-access memory (RAM) 730 for storage of instructions and data during program execution and a read only memory (ROM) 732 in which fixed instructions are stored.
- a file storage subsystem 726 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of certain implementations may be stored by file storage subsystem 726 in the storage subsystem 724, or in other machines accessible by the processor(s) 714.
- Computing device 710 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 710 depicted in FIG. 7 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 710 are possible having more or fewer components than the computing device depicted in FIG. 7.
- the users may be provided with an opportunity to control whether programs or features collect user information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current geographic location), or to control whether and/or how to receive content from the content server that may be more relevant to the user.
- user information e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current geographic location
- certain data may be treated in one or more ways before it is stored or used, so that personal identifiable information is removed.
- a method implemented by one or more processors includes identifying an entity for an automated assistant to engage with during an automated telephone call; selecting an initial voice to be utilized by the automated assistant and during the automated telephone call with the entity, the initial voice to be utilized by the automated assistant in generating one or more corresponding instances of synthesized speech to be rendered during the automated telephone call with the entity; initiating the automated telephone call with the entity; and during the automated telephone call with the entity: determining whether to select an alternative voice to be utilized, and in lieu of the initial voice, by the automated assistant and during the automated telephone call with the entity; and in response to determining to select the alternative voice to be utilized by the automated assistant and during the automated telephone call with the entity: selecting the alternative voice to be utilized by the automated assistant and during the automated telephone call with the entity, the alternative voice to be utilized by the automated assistant in generating the one or more corresponding instances of synthesized speech to be rendered during the automated telephone call with the entity; and causing the automated assistant to utilize the alternative voice in continuing with the automated telephone call
- the initial voice may be associated with a first set of prosodic properties
- the alternative voice may be associated with a second set of prosodic properties
- the second set of prosodic properties may differ from the first set of prosodic properties
- the method may further include, when the initial voice is being utilized by the automated assistant and during the automated telephone call with the entity: processing, using a text-to-speech (TTS) model, textual content to be provided for presentation to a representative associated with the entity and the first set of prosodic properties to generate one or more of the corresponding instances of synthesized speech audio data.
- the method may further include, when the alternative voice is being utilized by the automated assistant and during the automated telephone call with the entity: processing, using the TTS model, the textual content to be provided for presentation to the representative associated with the entity and the second set of prosodic properties to generate one or more of the corresponding instances of synthesized speech audio data.
- TTS text-to-speech
- the method may further include, when the initial voice is being utilized by the automated assistant and during the automated telephone call with the entity: processing, using the first TTS model, textual content to be provided for presentation to a representative associated with the entity to generate one or more of the corresponding instances of synthesized speech audio data.
- the method may further include, when the alternative voice is being utilized by the automated assistant and during the automated telephone call with the entity: processing, using the second TTS model, the textual content to be provided for presentation to the representative associated with the entity to generate one or more of the corresponding instances of synthesized speech audio data.
- selecting the initial voice to be utilized by the automated assistant and during the automated telephone call with the entity may be based on one or more of: a type of the entity, a particular location associated with the entity, or whether a phone number associated with the entity is a landline or non-landline.
- determining whether to select the alternative voice to be utilized, and in lieu of the initial voice, by the automated assistant and during the automated telephone call with the entity may be based on analyzing content received upon initiating the automated telephone call with the entity.
- the content received upon initiating the automated telephone call with the entity may include audio data from a representative that is associated with the entity or an interactive voice response (IVR) system that is associated with the entity.
- IVR interactive voice response
- determining whether to select the alternative voice to be utilized, and in lieu of the initial voice, by the automated assistant and during the automated telephone call with the entity may be prior to any of the one or more corresponding instances of synthesized speech audio data being rendered.
- determining whether to select the alternative voice to be utilized, and in lieu of the initial voice, by the automated assistant and during the automated telephone call with the entity may be subsequent to one or more of the corresponding instances of synthesized speech audio data being rendered.
- the automated assistant may be executed remotely from the client device of the user.
- identifying the entity for the automated assistant to engage with during the automated telephone call may be based on a spike in query activity across a population of client devices in a certain geographical area, and the automated assistant may initiate and conduct the automated telephone call on behalf of the population of client devices.
- the method may further include, subsequent to the automated assistant completing the automated telephone call: updating, based on a result of the automated telephone call, one or more databases.
- initiating the automated telephone call with the entity may include: obtaining a telephone number associated with the entity; and initiating, telephone number associated with the entity, the automated telephone call.
- a method implemented by one or more processors includes identifying an entity for an automated assistant to engage with during an automated telephone call; initiating the automated telephone call with the entity; and during the automated telephone call with the entity: identifying textual content to be provided for presentation to a representative associated with the entity, the textual content including a unique personal identifier; determining, based on the representative associated with the entity and/or based on the unique personal identifier, whether to generate synthesized speech audio data that includes the unique personal identifier on a character-by- character basis or the unique personal identifier on a non-character-by-character basis; and in response to determining to generate the synthesized speech audio data that includes the unique personal identifier on the character-by-character basis: processing, using a text-to- speech (TTS) model, the textual content to generate the synthesized speech audio data that includes the unique personal identifier on the character-by-character basis; and causing the synthesized speech audio data to be aud
- TTS text-to- speech
- determining to generate the synthesized speech audio data that includes the unique personal identifier on the non-character-by-character basis may be in response to determining that the representative associated with the entity is an automated assistant representative.
- determining whether to generate the synthesized speech audio data that includes the unique personal identifier on the character-by-character basis or the unique personal identifier on the non-character-by-character basis may be based on both the representative associated with the entity and the unique personal identifier.
- the unique personal identifier may be one or more of: a name, an email address, a physical address, a username, a password, a name of an entity, or a domain name.
- the method may further include, in response to determining to generate the synthesized speech audio data that includes the unique personal identifier on the non-character-by-character basis: processing, using a text-to-speech (TTS) model, the textual content to generate the synthesized speech audio data that includes the unique personal identifier on the non-character-by-character basis; and causing the synthesized speech audio data to be audibly rendered for presentation to the representative associated with the entity.
- TTS text-to-speech
- a method implemented by one or more processors includes identifying an entity for an automated assistant to engage with during an automated telephone call; initiating the automated telephone call with the entity; and during the automated telephone call with the entity: identifying textual content to be provided for presentation to a representative associated with the entity, the textual content including a unique personal identifier; determining, based on the representative associated with the entity and/or based on the unique personal identifier, whether to inject one or more pauses into synthesized speech audio data that includes the unique personal identifier; and in response to determining to inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier: processing, using a text-to-speech (TTS) model, the textual content to generate the synthesized speech audio data that includes the unique personal identifier and the one or more pauses; and causing the synthesized speech audio data to be audibly rendered for presentation to the representative associated with the entity.
- TTS text-to-speech
- determining whether to inject the one or more pauses into synthesized speech audio data that includes the unique personal identifier may be based on the representative associated with the entity.
- determining to inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier may be in response to determining that the representative associated with the entity is a human representative.
- determining to not inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier may be in response to determining that the representative associated with the entity is an automated assistant representative.
- determining whether to inject the one or more pauses into synthesized speech audio data that includes the unique personal identifier may be based on the unique personal identifier, and determining whether to inject the one or more pauses into synthesized speech audio data that includes the unique personal identifier may be based on one or more of: a frequency of the unique personal identifier, a length of the unique personal identifier, or a complexity of the unique personal identifier.
- determining to inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier may be in response to determining that the frequency of the unique personal identifier fails to satisfy a frequency threshold.
- determining to not inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier may be in response to determining that the frequency of the unique personal identifier satisfies the frequency threshold.
- determining to not inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier may be in response to determining that the length of the unique personal identifier fails to satisfy the length threshold.
- determining to inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier may be in response to determining that the complexity of the unique personal identifier satisfies a complexity threshold.
- determining to not inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier may be in response to determining that the complexity of the unique personal identifier fails to satisfy the complexity threshold.
- determining whether to inject the one or more pauses into synthesized speech audio data that includes the unique personal identifier may be based on the unique personal identifier is based on both the representative associated with the entity and the unique personal identifier.
- the unique personal identifier may be one or more of: an email address, a physical address, a username, a password, a name of an entity, or a domain name.
- the method may further include, in response to determining to not inject the one or more pauses into the synthesized speech audio data that includes the unique personal identifier: processing, using a text-to-speech (TTS) model, the textual content to generate the synthesized speech audio data that includes the unique personal identifier and without the one or more pauses; and causing the synthesized speech audio data to be audibly rendered for presentation to the representative associated with the entity.
- TTS text-to-speech
- some implementations include one or more processors (e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)) of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the aforementioned methods.
- processors e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)
- Some implementations also include one or more non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the aforementioned methods.
- Some implementations also include a computer program product including instructions executable by one or more processors to perform any of the aforementioned methods.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
Des modes de réalisation concernent l'adaptation dynamique de la synthèse de la parole par un assistant automatisé lors d'un ou plusieurs appels téléphoniques automatisés. Dans certains modes de réalisation, un ou des processeurs peuvent sélectionner une voix initiale à utiliser par l'assistant automatisé pour générer des données audio de parole synthétisées et lors d'un appel téléphonique automatisé. Cependant, lors de l'appel téléphonique automatisé, le ou les processeurs peuvent décider de sélectionner une voix alternative à utiliser par l'assistant automatisé pour générer des données audio de parole synthétisées et poursuivre l'appel téléphonique automatisé. Dans des modes de réalisation supplémentaires ou alternatifs, et lors de l'appel téléphonique automatisé, le ou les processeurs peuvent déterminer s'il faut générer des données audio de parole synthétisées qui comprennent un identifiant personnel unique sur une base caractère par caractère ou l'identifiant personnel unique sur une base non caractère par caractère. Dans des modes de réalisation supplémentaires ou alternatifs, et lors de l'appel téléphonique automatisé, le ou les processeurs peuvent déterminer s'il faut injecter une ou plusieurs pauses dans n'importe quelles données audio de parole synthétisées générées.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363615666P | 2023-12-28 | 2023-12-28 | |
| US18/402,293 US20250218423A1 (en) | 2023-12-28 | 2024-01-02 | Dynamic adaptation of speech synthesis by an automated assistant during automated telephone call(s) |
| PCT/US2024/062115 WO2025145052A1 (fr) | 2023-12-28 | 2024-12-27 | Adaptation dynamique de la synthèse de la parole par un assistant automatisé lors d'un ou plusieurs appels téléphoniques automatisés |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| EP4599429A1 true EP4599429A1 (fr) | 2025-08-13 |
Family
ID=94283398
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP24837819.2A Pending EP4599429A1 (fr) | 2023-12-28 | 2024-12-27 | Adaptation dynamique de la synthèse de la parole par un assistant automatisé lors d'un ou plusieurs appels téléphoniques automatisés |
Country Status (2)
| Country | Link |
|---|---|
| EP (1) | EP4599429A1 (fr) |
| WO (1) | WO2025145052A1 (fr) |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10770059B2 (en) * | 2019-01-29 | 2020-09-08 | Gridspace Inc. | Conversational speech agent |
| WO2020227313A1 (fr) * | 2019-05-06 | 2020-11-12 | Google Llc | Système d'appel automatisé |
| CN115088033A (zh) * | 2020-02-10 | 2022-09-20 | 谷歌有限责任公司 | 代表对话中的人参与者生成的合成语音音频数据 |
| EP3909230B1 (fr) * | 2020-03-20 | 2024-10-23 | Google LLC | Appel semi-délégué par un assistant automatisé pour le compte d'un participant humain |
-
2024
- 2024-12-27 EP EP24837819.2A patent/EP4599429A1/fr active Pending
- 2024-12-27 WO PCT/US2024/062115 patent/WO2025145052A1/fr active Pending
Also Published As
| Publication number | Publication date |
|---|---|
| WO2025145052A1 (fr) | 2025-07-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12518737B2 (en) | Structured description-based chatbot development techniques | |
| US12334049B2 (en) | Unstructured description-based chatbot development techniques | |
| US12125477B2 (en) | Hot-word free pre-emption of automated assistant response presentation | |
| EP4118647B1 (fr) | Résolution d'identificateurs personnels uniques pendant des conversations correspondantes entre un robot vocal et un être humain | |
| US12615224B2 (en) | Voice wrapper(s) for existing third-party text-based chatbot(s) | |
| US12526247B2 (en) | System(s) and method(s) for enabling a representative associated with an entity to modify a trained voice bot associated with the entity | |
| US11568869B2 (en) | Low latency automated identification of automated assistant function failure | |
| US20250218423A1 (en) | Dynamic adaptation of speech synthesis by an automated assistant during automated telephone call(s) | |
| US20250095632A1 (en) | Voice wrapper(s) for existing first-party text-based chatbot(s) | |
| US20240205331A1 (en) | Reducing telephone network traffic through utilization of pre-call information | |
| US20250046305A1 (en) | Voice-based chatbot policy override(s) for existing voice-based chatbot(s) | |
| WO2025145052A1 (fr) | Adaptation dynamique de la synthèse de la parole par un assistant automatisé lors d'un ou plusieurs appels téléphoniques automatisés | |
| US20250317514A1 (en) | Determining whether and/or when to cause automated assistant(s) to initiate and conduct automated telephone call(s) | |
| US20250133036A1 (en) | Simulation of automated telephone call(s) | |
| WO2025064241A1 (fr) | Une ou plusieurs synthèses vocales pour un ou plusieurs robots conversationnels basés sur un texte de première partie existant | |
| WO2025085267A1 (fr) | Simulation d'appel(s) téléphonique(s) automatisé(s) |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
| 17P | Request for examination filed |
Effective date: 20250508 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR |