CN116830561A

CN116830561A - Echo reference prioritization and selection

Info

Publication number: CN116830561A
Application number: CN202280013990.5A
Authority: CN
Inventors: B·J·索斯韦尔; C·G·海因斯; D·古纳万
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2021-02-09
Filing date: 2022-02-07
Publication date: 2023-09-29
Also published as: CN116830560A

Abstract

一些实施方式涉及获得多个回声参考，多个回声参考包括针对音频环境中的多个音频设备中的每个音频设备的至少一个回声参考，每个回声参考对应于由多个音频设备中的一个音频设备的一个或多个扩音器回放的音频数据。一些示例涉及对多个回声参考中的每个回声参考做出重要性估计。做出重要性估计可以涉及确定每个回声参考对由音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻的预期贡献。一些实施方式涉及至少部分地基于重要性估计来选择一个或多个所选回声参考并将一个或多个所选回声参考提供给至少一个回声管理系统。Some embodiments involve obtaining a plurality of echo references, the plurality of echo references including at least one echo reference for each of a plurality of audio devices in an audio environment, each echo reference corresponding to a response provided by one of the plurality of audio devices. Audio data played back by one or more loudspeakers of an audio device. Some examples involve making importance estimates for each of multiple echo references. Making the importance estimate may involve determining the expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. Some embodiments involve selecting one or more selected echo references based at least in part on an importance estimate and providing the one or more selected echo references to at least one echo management system.

Description

Echo reference prioritization and selection

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2021年2月9日提交的美国临时申请No.63/147,573、于2021年5月19日提交的美国临时申请No.63/201,939以及于2021年6月2日提交的欧洲申请No.21177382.5的优先权，所有这些申请均通过援引被整体并入本文。This application claims priority to U.S. Provisional Application No. 63/147,573 filed on February 9, 2021, U.S. Provisional Application No. 63/201,939 filed on May 19, 2021, and European Application No. 21177382.5 filed on June 2, 2021, all of which are incorporated herein by reference in their entirety.

技术领域Technical Field

本公开涉及用于实施声学回声管理的设备、系统和方法。The present disclosure relates to devices, systems, and methods for implementing acoustic echo management.

背景技术Background Art

具有声学回声管理系统的音频设备已被广泛部署。声学回声管理系统可以包括声学回声消除器和/或声学回声抑制器。尽管用于声学回声管理的现有设备、系统和方法提供了益处，但改进的设备、系统和方法将仍是期望的。Audio devices with acoustic echo management systems have been widely deployed. An acoustic echo management system may include an acoustic echo canceller and/or an acoustic echo suppressor. Although existing devices, systems, and methods for acoustic echo management provide benefits, improved devices, systems, and methods would still be desirable.

符号和术语Symbols and terminology

贯穿本公开，包括在权利要求书中，术语“扬声器(speaker)”、“扩音器(loudspeaker)”和“音频再现换能器”同义地用于表示任何发声换能器(或一组换能器)。一套典型的耳机包括两个扬声器。扬声器可以被实施为包括多个换能器(例如，低音扬声器和高音扬声器)，所述换能器可以由单个公共扬声器馈送或多个扬声器馈送驱动。在一些示例中，(多个)扬声器馈送可以在耦接到不同换能器的不同电路分支中经历不同处理。Throughout this disclosure, including in the claims, the terms "speaker," "loudspeaker," and "audio reproduction transducer" are used synonymously to refer to any sound-producing transducer (or set of transducers). A typical set of headphones includes two speakers. Speakers may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuit branches coupled to different transducers.

贯穿本公开，包括在权利要求中，在广义上使用“对信号或数据执行操作(performing an operation“on”a signal or data)”的表达(例如，对信号或数据进行滤波、缩放、变换或应用增益)来表示直接对信号或数据执行操作或对信号或数据的已处理版本(例如，在对其执行操作之前已经历了初步滤波或预处理的信号版本)执行操作。Throughout this disclosure, including in the claims, the expression “performing an operation “on” a signal or data” (e.g., filtering, scaling, transforming, or applying a gain to a signal or data) is used in a broad sense to mean performing the operation directly on the signal or data or on a processed version of the signal or data (e.g., a version of the signal that has undergone preliminary filtering or preprocessing before the operation is performed on it).

贯穿本公开，包括在权利要求中，在广义上使用表达“系统”来表示设备、系统或子系统。例如，实施解码器的子系统可以被称为解码器系统，并且包括这样的子系统的系统(例如，响应于多个输入而生成X个输出信号的系统，其中，所述子系统生成M个输入，而其他X-M个输入是从外部源接收的)也可以被称为解码器系统。Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to refer to a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system that includes such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.

贯穿本公开，包括在权利要求中，在广义上使用术语“处理器”来表示可编程或以其他方式可配置(例如，用软件或固件)为对数据(例如，音频或视频或其他图像数据)执行操作的系统或设备。处理器的示例包括现场可编程门阵列(或其他可配置集成电路或芯片组)、被编程和/或以其他方式被配置为对音频或其他声音数据执行流水线式处理的数字信号处理器、可编程通用处理器或计算机、以及可编程微处理器芯片或芯片组。Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to refer to a system or device that is programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chipsets), digital signal processors that are programmed and/or otherwise configured to perform pipeline processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chipsets.

贯穿本公开，包括在权利要求中，术语“耦接(couple)”或“耦接的(coupled)”用于意指直接或间接连接。因此，如果第一设备耦接到第二设备，则该连接可以通过直接连接或者通过经由其他设备和连接的间接连接实现。Throughout this disclosure, including in the claims, the terms "couple" or "coupled" are used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.

如本文所使用的，“智能设备”是可以在某种程度上交互地和/或自主地操作的电子设备，其通常被配置用于经由如蓝牙、Zigbee、近场通信、Wi-Fi、光保真(Li-Fi)、3G、4G、5G等各种无线协议与一个或多个其他设备(或网络)进行通信。一些著名的智能设备类型是智能电话、智能汽车、智能恒温器、智能门铃、智能锁、智能冰箱、平板手机和平板计算机、智能手表、智能手环、智能钥匙链和智能音频设备。术语“智能设备”还可以是指展现出如人工智能等普适计算的一些性质的设备。As used herein, a "smart device" is an electronic device that can operate interactively and/or autonomously to some extent, and is typically configured to communicate with one or more other devices (or networks) via various wireless protocols such as Bluetooth, Zigbee, near field communication, Wi-Fi, Light Fidelity (Li-Fi), 3G, 4G, 5G, etc. Some well-known types of smart devices are smart phones, smart cars, smart thermostats, smart doorbells, smart locks, smart refrigerators, tablet phones and tablet computers, smart watches, smart bracelets, smart key chains, and smart audio devices. The term "smart device" may also refer to devices that exhibit some properties of ubiquitous computing such as artificial intelligence.

在本文中，使用表达“智能音频设备”来表示智能设备，其是单一用途音频设备或多用途音频设备(例如，实施虚拟助理功能的至少一些方面的音频设备)。单一用途音频设备是包括或耦接到至少一个麦克风(并且可选地还包括或耦接到至少一个扬声器和/或至少一个相机)并且很大程度上或主要被设计为实现单一用途的设备(例如，电视(TV))。例如，尽管TV通常可以播放(并且被认为能够播放)来自节目素材的音频，但在大多数实例中，现代TV运行某种操作系统，应用程序(包括看电视的应用程序)在所述操作系统上本地运行。从这个意义上说，具有(多个)扬声器和(多个)麦克风的单一用途音频设备通常被配置为运行本地应用程序和/或服务以直接使用所述(多个)扬声器和(多个)麦克风。一些单一用途音频设备可以被配置为组合在一起以实现在一定区或用户配置区域上播放音频。In this document, the expression "smart audio device" is used to represent a smart device, which is a single-purpose audio device or a multi-purpose audio device (e.g., an audio device that implements at least some aspects of the virtual assistant functionality). A single-purpose audio device is a device (e.g., a television (TV)) that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera) and is largely or primarily designed to implement a single purpose. For example, although a TV can generally play (and is considered to be able to play) audio from program material, in most instances, modern TVs run some kind of operating system on which applications (including applications for watching TV) run locally. In this sense, a single-purpose audio device with (multiple) speakers and (multiple) microphones is typically configured to run local applications and/or services to directly use the (multiple) speakers and (multiple) microphones. Some single-purpose audio devices can be configured to be combined together to enable audio to be played in a certain area or user-configured area.

一种常见类型的多用途音频设备是实施虚拟助理功能的至少一些方面的音频设备，尽管虚拟助理功能的其他方面可以由比如一个或多个服务器等一个或多个其他设备来实施，多用途音频设备被配置用于与所述一个或多个服务器通信。这样的多用途音频设备在本文中可以被称为“虚拟助理”。虚拟助理是包括或耦接到至少一个麦克风(并且可选地还包括或耦接到至少一个扬声器和/或至少一个相机)的设备(例如，智能扬声器或语音助理集成设备)。在一些示例中，虚拟助理可以提供将多个设备(不同于虚拟助理)用于某种意义上支持云的应用程序或以其他方式未在虚拟助理本身中或之上完全实施的应用程序的能力。换句话说，虚拟助理功能的至少一些方面(例如，言语识别功能)可以(至少部分地)由一个或多个服务器或其他设备实施，虚拟助理可以经由网络(如因特网)与所述一个或多个服务器或其他设备通信。虚拟助理有时可以一起工作，例如，以离散和有条件地定义的方式。例如，两个或更多个虚拟助理可以在其中之一(例如，最确信已经听到唤醒词的虚拟助理)对唤醒词作出响应的意义上一起工作。在一些实施方式中，连接的虚拟助理可以形成一种星座，所述星座可以由一个主应用程序管理，所述主应用程序可以是(或实施)虚拟助理。A common type of multi-purpose audio device is an audio device that implements at least some aspects of the virtual assistant function, although other aspects of the virtual assistant function may be implemented by one or more other devices such as one or more servers, and the multi-purpose audio device is configured to communicate with the one or more servers. Such a multi-purpose audio device may be referred to as a "virtual assistant" in this article. A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera). In some examples, a virtual assistant may provide the ability to use multiple devices (different from the virtual assistant) for applications that support the cloud in a sense or are otherwise not fully implemented in or on the virtual assistant itself. In other words, at least some aspects of the virtual assistant function (e.g., speech recognition function) may be (at least partially) implemented by one or more servers or other devices, and the virtual assistant may communicate with the one or more servers or other devices via a network (e.g., the Internet). Virtual assistants can sometimes work together, for example, in a discrete and conditionally defined manner. For example, two or more virtual assistants can work together in the sense that one of them (e.g., the virtual assistant that is most confident that it has heard the wake-up word) responds to the wake-up word. In some implementations, the connected virtual assistants may form a constellation, which may be managed by a master application, which may be (or implement) the virtual assistant.

在本文中，“唤醒词”在广义上用于表示任何声音(例如，人类说出的词或其他声音)，其中智能音频设备被配置成响应于检测到(“听到”)声音(使用包括在智能音频设备中或耦接到所述智能音频设备的至少一个麦克风，或至少一个其他麦克风)而唤醒。在这种背景下，“唤醒”表示设备进入等待(换句话说，正在收听)声音命令的状态。在一些实例中，本文中所谓的“唤醒词”可以包括多于一个词，例如，短语。In this document, "wake-up word" is used in a broad sense to refer to any sound (e.g., a word spoken by a human or other sound) that the smart audio device is configured to wake up in response to detecting ("hearing") the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, "waking up" means that the device enters a state of waiting (in other words, listening) for a voice command. In some instances, the so-called "wake-up word" in this document can include more than one word, for example, a phrase.

在本文中，表达“唤醒词检测器”表示被配置成连续搜索实时声音(例如，言语)特征与训练模型之间的对齐的设备(或表示包括用于将设备配置成连续搜索实时声音特征与训练模型之间的对齐的指令的软件)。通常，每当唤醒词检测器确定检测到唤醒词的概率超过预定义阈值，就会触发唤醒词事件。例如，所述阈值可以是被调整以在错误接受率与错误拒绝率之间给出合理折衷的预定阈值。在唤醒词事件之后，设备可能会进入一种状态(可以被称为“唤醒”状态或“注意力”状态)，在所述状态下设备会收听命令并且将接收到的命令传递给更大、计算更密集的识别器。In this document, the expression "wake-up word detector" refers to a device configured to continuously search for alignment between real-time sound (e.g., speech) features and a trained model (or refers to software including instructions for configuring the device to continuously search for alignment between real-time sound features and a trained model). Typically, a wake-up word event is triggered whenever the wake-up word detector determines that the probability of detecting a wake-up word exceeds a predefined threshold. For example, the threshold may be a predetermined threshold that is adjusted to give a reasonable compromise between a false acceptance rate and a false rejection rate. After a wake-up word event, the device may enter a state (which may be referred to as an "awake" state or an "attention" state) in which the device listens for commands and passes received commands to a larger, more computationally intensive recognizer.

如本文所使用的，术语“节目流”和“内容流”是指一个或多个音频信号的集合，并且在一些实例中是指视频信号的集合，所述信号集合的至少部分是旨在要一起听到的。示例包括音乐选集、电影原声、电影、电视节目、电视节目的音频部分、播客、现场语音通话、来自智能助理的合成语音响应等。在一些实例中，内容流可以包括音频信号的至少一部分的多个版本，例如，多于一种语言的同一对话。在这样的实例中，一次旨在再现音频数据或其部分的仅一个版本(例如，与单一语言相对应的版本)。As used herein, the terms "program stream" and "content stream" refer to a collection of one or more audio signals, and in some instances a collection of video signals, at least a portion of which is intended to be heard together. Examples include music selections, movie soundtracks, movies, television programs, audio portions of television programs, podcasts, live voice calls, synthesized voice responses from intelligent assistants, and the like. In some instances, a content stream may include multiple versions of at least a portion of an audio signal, e.g., the same conversation in more than one language. In such instances, only one version of the audio data or portion thereof is intended to be reproduced at a time (e.g., a version corresponding to a single language).

发明内容Summary of the invention

本公开的至少一些方面可以经由一种或多种音频处理方法来实施。所述音频处理方法管理音频系统中的回声。所述音频系统包括音频环境中的多个音频设备。所述多个音频设备中的每个设备包括一个或多个扩音器。在一些实例中，(多种)方法可以至少部分地由控制系统和/或经由存储在所述音频系统的所述多个音频设备中的第一设备的一个或多个非暂态介质上的指令(例如，软件)来实施。所述第一设备可以包括一个或多个麦克风。一些这样的方法涉及由所述第一设备的所述控制系统获得多个回声参考。所述多个回声参考可以包括针对所述音频环境中的所述多个音频设备中的每个音频设备的至少一个回声参考。每个回声参考可以对应于由所述多个音频设备中的对应音频设备的一个或多个扩音器回放的音频数据。所述多个回声参考包括所述第一音频设备的至少一个回声参考。At least some aspects of the present disclosure may be implemented via one or more audio processing methods. The audio processing method manages echoes in an audio system. The audio system includes multiple audio devices in an audio environment. Each of the multiple audio devices includes one or more loudspeakers. In some instances, (multiple) methods may be implemented at least in part by a control system and/or via instructions (e.g., software) on one or more non-transitory media stored in a first device of the multiple audio devices of the audio system. The first device may include one or more microphones. Some such methods involve obtaining multiple echo references by the control system of the first device. The multiple echo references may include at least one echo reference for each of the multiple audio devices in the audio environment. Each echo reference may correspond to audio data played back by one or more loudspeakers of a corresponding audio device in the multiple audio devices. The multiple echo references include at least one echo reference for the first audio device.

所述方法可以涉及由所述控制系统对所述多个回声参考中的每个回声参考做出重要性估计。在一些示例中，做出重要性估计可以涉及确定每个回声参考对由音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻的预期贡献。(多个)回声管理系统可以例如包括声学回声消除器(AEC)和/或声学回声抑制器(AES)。The method may involve making, by the control system, an importance estimate for each of the plurality of echo references. In some examples, making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation performed by at least one echo management system of at least one audio device of the audio environment. The echo management system(s) may, for example, include an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES).

所述方法可以涉及由所述控制系统并且至少部分地基于所述重要性估计从所述多个回声参考中选择一个或多个回声参考。所选回声参考可以是(整个)多个回声参考中的一个或多个回声参考的子集。所述方法可以涉及由所述控制系统将所述一个或多个所选回声参考提供给所述至少一个回声管理系统。在一些示例中，所述方法可以涉及使得至少一个回声管理系统至少部分地基于所述一个或多个所选回声参考来消除或抑制回声。The method may involve selecting, by the control system and based at least in part on the importance estimate, one or more echo references from the plurality of echo references. The selected echo references may be a subset of one or more echo references from the (entire) plurality of echo references. The method may involve providing, by the control system, the one or more selected echo references to the at least one echo management system. In some examples, the method may involve causing at least one echo management system to cancel or suppress echoes based at least in part on the one or more selected echo references.

根据一些示例，所述音频系统的音频设备可以经由有线或无线通信网络通信地耦接。所述多个回声参考(例如，不同于所述第一音频设备的其他音频设备的非本地回声参考和/或所述第一音频设备的回声参考)可以经由所述有线或无线通信网络获得。According to some examples, the audio devices of the audio system may be communicatively coupled via a wired or wireless communication network. The plurality of echo references (e.g., non-local echo references of other audio devices different from the first audio device and/or an echo reference of the first audio device) may be obtained via the wired or wireless communication network.

根据一些示例，获得所述多个回声参考可以涉及接收包括音频数据的内容流并基于所述音频数据来确定所述多个回声参考中的一个或多个回声参考。According to some examples, obtaining the plurality of echo references may involve receiving a content stream including audio data and determining one or more echo references of the plurality of echo references based on the audio data.

在一些实施方式中，所述控制系统可以是或者可以包括所述音频环境中的音频设备的音频设备控制系统。在一些这样的实施方式中，所述方法可以涉及由音频设备控制系统渲染音频数据以用于在音频设备上再现，从而产生本地扬声器馈送信号。在一些这样的实施方式中，所述方法可以涉及确定与所述本地扬声器馈送信号相对应的本地回声参考。In some embodiments, the control system may be or include an audio device control system of an audio device in the audio environment. In some such embodiments, the method may involve rendering, by the audio device control system, audio data for reproduction on an audio device, thereby generating a local speaker feed signal. In some such embodiments, the method may involve determining a local echo reference corresponding to the local speaker feed signal.

在一些示例中，获得多个回声参考可以涉及基于音频数据来确定一个或多个非本地回声参考。在一些这样的示例中，每个非本地回声参考可以对应于用于在所述音频环境的另一音频设备上回放的非本地扬声器馈送信号。In some examples, obtaining a plurality of echo references may involve determining one or more non-local echo references based on the audio data. In some such examples, each non-local echo reference may correspond to a non-local speaker feed signal for playback on another audio device of the audio environment.

根据一些示例，获得多个回声参考可以涉及接收一个或多个非本地回声参考。在一些这样的示例中，每个非本地回声参考可以对应于用于在所述音频环境的另一音频设备上回放的非本地扬声器馈送信号。在一些示例中，接收一个或多个非本地回声参考可以涉及从音频环境的一个或多个其他音频设备接收一个或多个非本地回声参考。在一些示例中，接收一个或多个非本地回声参考可以涉及从音频环境的单个其他设备接收一个或多个非本地回声参考中的每一个。According to some examples, obtaining a plurality of echo references may involve receiving one or more non-local echo references. In some such examples, each non-local echo reference may correspond to a non-local speaker feed signal for playback on another audio device of the audio environment. In some examples, receiving one or more non-local echo references may involve receiving one or more non-local echo references from one or more other audio devices of the audio environment. In some examples, receiving one or more non-local echo references may involve receiving each of the one or more non-local echo references from a single other device of the audio environment.

在一些示例中，所述方法可以涉及成本确定。根据一些示例，成本确定可以涉及确定多个回声参考中的至少一个回声参考的成本。在一些示例中，选择所述一个或多个所选回声参考可以至少部分地基于所述成本确定。In some examples, the method may involve cost determination. According to some examples, the cost determination may involve determining a cost of at least one echo reference of a plurality of echo references. In some examples, selecting the one or more selected echo references may be based at least in part on the cost determination.

根据一些示例，所述成本确定可以基于用于传输所述至少一个回声参考所需的网络带宽、用于编码所述至少一个回声参考的编码计算要求、用于解码所述至少一个回声参考的解码计算要求、用于由所述回声管理系统使用所述至少一个回声参考的回声管理系统计算要求、或其一个或多个组合。According to some examples, the cost determination can be based on network bandwidth required for transmitting the at least one echo reference, encoding computational requirements for encoding the at least one echo reference, decoding computational requirements for decoding the at least one echo reference, echo management system computational requirements for using the at least one echo reference by the echo management system, or one or more combinations thereof.

在一些示例中，成本确定可以基于至少一个回声参考在时域或频域中的复制品、至少一个回声参考的下采样版本、至少一个回声参考的有损压缩、至少一个回声参考的分段功率信息、或其一个或多个组合。根据一些示例，所述成本确定可以基于与相对不太重要的回声参考相比对相对更重要的回声参考进行更少压缩的方法。In some examples, the cost determination may be based on a replica of the at least one echo reference in the time domain or the frequency domain, a downsampled version of the at least one echo reference, a lossy compression of the at least one echo reference, segment power information of the at least one echo reference, or one or more combinations thereof. According to some examples, the cost determination may be based on a method of compressing relatively more important echo references less than relatively less important echo references.

在一些示例中，所述方法可以涉及确定当前回声管理系统性能水平。根据一些示例，选择所述一个或多个所选回声参考可以至少部分地基于所述当前回声管理系统性能水平。In some examples, the method may involve determining a current echo management system performance level.According to some examples, selecting the one or more selected echo references may be based at least in part on the current echo management system performance level.

根据一些示例，做出重要性估计可以涉及确定对应回声参考的重要性度量。在一些这样的示例中，确定重要性度量可以涉及确定对应回声参考的水平、确定对应回声参考的唯一性、确定对应回声参考的时间持续性、确定对应回声参考的可听度、或其一个或多个组合。According to some examples, making the importance estimate may involve determining an importance metric for the corresponding echo reference. In some such examples, determining the importance metric may involve determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a temporal duration of the corresponding echo reference, determining an audibility of the corresponding echo reference, or one or more combinations thereof.

在一些实例中，确定所述重要性度量可以至少部分地基于与音频设备布局相对应的数据或元数据、扩音器元数据、与接收到的音频数据相对应的元数据、上混合(upmixing)矩阵、扩音器激活矩阵、或其一个或多个组合。In some examples, determining the importance metric can be based at least in part on data or metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a loudspeaker activation matrix, or one or more combinations thereof.

根据一些示例，确定所述重要性度量可以至少部分地基于当前收听目标、当前环境噪声估计、所述至少一个回声管理系统的当前性能的估计、或其一个或多个组合。According to some examples, determining the importance metric may be based at least in part on current listening goals, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.

本文描述的一些或所有操作、功能和/或方法可以由一个或多个设备根据存储在一个或多个非暂态介质上的指令(例如，软件)来执行。这样的非暂态介质可以包括如本文描述的存储器设备等存储器设备，包括但不限于随机存取存储器(RAM)设备、只读存储器(ROM)设备等。因此，本公开中描述的主题的一些创新方面可以经由在其上存储有软件的一个或多个非暂态介质来实施。Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as memory devices described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Therefore, some innovative aspects of the subject matter described in this disclosure may be implemented via one or more non-transitory media having software stored thereon.

本公开的至少一些方面可以经由装置来实施。例如，一个或多个设备可以能够至少部分地执行本文公开的方法。在一些实施方式中，装置是或包括具有接口系统和控制系统的音频处理系统。控制系统可以包括一个或多个通用单芯片或多芯片处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其他可编程逻辑设备、离散门或晶体管逻辑、离散硬件部件或其组合。At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of at least partially performing the methods disclosed herein. In some embodiments, the apparatus is or includes an audio processing system having an interface system and a control system. The control system may include one or more general-purpose single-chip or multi-chip processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or a combination thereof.

在以下附图和说明中阐述了本说明书中所描述的主题的一个或多个实施方式的细节。从所述描述、附图和权利要求中，其他特征、方面和优点将变得显而易见。注意，以下附图的相对尺寸可能不是按比例来绘制的。The details of one or more embodiments of the subject matter described in this specification are set forth in the following drawings and descriptions. Other features, aspects, and advantages will become apparent from the description, drawings, and claims. Note that the relative sizes of the following drawings may not be drawn to scale.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在各个附图中，相似的附图标记和名称指示相似的要素。Like reference numbers and designations throughout the various drawings indicate like elements.

图1A是示出了能够实施本公开的各个方面的装置的部件的示例的框图。FIG. 1A is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure.

图1B示出了音频环境的示例。FIG. 1B shows an example of an audio environment.

图1C和图1D示出了音频设备110A-110C可以如何接收回放声道的示例。1C and 1D show examples of how audio devices 110A- 110C may receive playback channels.

图1E示出了音频环境的另一示例。FIG. 1E shows another example of an audio environment.

图2A呈现了能够执行至少一些所公开的实施方式的音频设备的框图。2A presents a block diagram of an audio device capable of performing at least some disclosed embodiments.

图2B和图2C示出了音频环境中的音频设备的附加示例。2B and 2C illustrate additional examples of audio devices in an audio environment.

图3A呈现了示出根据一个示例的音频设备的部件的框图。FIG3A presents a block diagram illustrating components of an audio device according to one example.

图3B和图3C是示出预期回声管理性能与用于回声管理的回声参考的数量的示例的图表。3B and 3C are graphs showing examples of expected echo management performance versus the number of echo references used for echo management.

图4呈现了示出根据一个示例的回声参考编排器的部件的框图。FIG4 presents a block diagram illustrating components of an echo reference organizer according to one example.

图5A是概述所公开方法的一个示例的流程图。FIG. 5A is a flow chart outlining one example of the disclosed method.

图5B是概述了所公开方法的另一示例的流程图。FIG. 5B is a flow chart outlining another example of the disclosed method.

图6是概述所公开方法的一个示例的流程图。FIG. 6 is a flow chart outlining one example of the disclosed method.

图7示出了音频环境的平面图的示例，所述音频环境在该示例中是生活空间。FIG. 7 shows an example of a floor plan of an audio environment, which in this example is a living space.

具体实施方式DETAILED DESCRIPTION

图1A是示出了能够实施本公开的各个方面的装置的部件的示例的框图。与本文提供的其他图一样，图1A所示的要素的类型和数量仅作为示例提供。其他实施方式可以包括更多、更少和/或不同类型和数量的要素。根据一些示例，装置50可以被配置用于执行本文公开的方法中的至少一些方法。在一些实施方式中，装置50可以是或者可以包括音频系统的一个或多个部件。例如，在一些实施方式中，装置50可以是音频设备，如智能音频设备。在其他示例中，装置50可以是移动设备(如蜂窝电话)、膝上型计算机、平板计算机设备、电视或其他类型的设备。Fig. 1A is a block diagram showing an example of a component of a device that can implement various aspects of the present disclosure. As with other figures provided herein, the type and quantity of the elements shown in Fig. 1A are provided only as examples. Other embodiments may include more, less and/or different types and quantities of elements. According to some examples, device 50 may be configured to perform at least some of the methods disclosed herein. In some embodiments, device 50 may be or may include one or more components of an audio system. For example, in some embodiments, device 50 may be an audio device, such as an intelligent audio device. In other examples, device 50 may be a mobile device (such as a cellular phone), a laptop computer, a tablet computer device, a television, or other types of devices.

根据一些替代性实施方式，装置50可以是或者可以包括服务器。在一些这样的示例中，装置50可以是或者可以包括编码器。因此，在一些实例中，装置50可以是被配置用于在如家庭音频环境的音频环境内使用的设备，然而在其他实例中，装置50可以是被配置用于在“云”中使用的设备，例如，服务器。According to some alternative embodiments, the apparatus 50 may be or may include a server. In some such examples, the apparatus 50 may be or may include an encoder. Thus, in some instances, the apparatus 50 may be a device configured for use within an audio environment such as a home audio environment, whereas in other instances, the apparatus 50 may be a device configured for use in the "cloud", e.g., a server.

在该示例中，装置50包括接口系统55和控制系统60。在一些实施方式中，接口系统55可以被配置用于与音频环境的一个或多个其他设备进行通信。在一些示例中，音频环境可以是家庭音频环境。在其他示例中，音频环境可以是另一种类型的环境，如办公室环境、汽车环境、火车环境、街道或人行道环境、公园环境等。在一些实施方式中，接口系统55可以被配置用于与音频环境的音频设备交换控制信息和相关联的数据。在一些示例中，控制信息和相关联的数据可以与装置50正执行的一个或多个软件应用程序有关。In this example, the device 50 includes an interface system 55 and a control system 60. In some embodiments, the interface system 55 can be configured to communicate with one or more other devices of the audio environment. In some examples, the audio environment can be a home audio environment. In other examples, the audio environment can be another type of environment, such as an office environment, a car environment, a train environment, a street or sidewalk environment, a park environment, etc. In some embodiments, the interface system 55 can be configured to exchange control information and associated data with the audio devices of the audio environment. In some examples, the control information and associated data can be related to one or more software applications being executed by the device 50.

在一些实施方式中，接口系统55可以被配置用于接收内容流或用于提供内容流。内容流可以包括音频数据。音频数据可以包括但可以不限于音频信号。在一些实例中，音频数据可以包括如声道数据和/或空间元数据等空间数据。例如，元数据可以由本文中可以被称为“编码器”的设备提供。在一些示例中，内容流可以包括视频数据和与视频数据相对应的音频数据。In some embodiments, the interface system 55 may be configured to receive a content stream or to provide a content stream. The content stream may include audio data. The audio data may include but may not be limited to an audio signal. In some instances, the audio data may include spatial data such as channel data and/or spatial metadata. For example, metadata may be provided by a device that may be referred to as an "encoder" herein. In some examples, the content stream may include video data and audio data corresponding to the video data.

接口系统55可以包括一个或多个网络接口和/或一个或多个外部设备接口(如一个或多个通用串行总线(USB)接口)。根据一些实施方式，接口系统55可以包括一个或多个无线接口。接口系统55可以包括用于实施用户接口的一个或多个设备，如一个或多个麦克风、一个或多个扬声器、显示系统、触摸传感器系统和/或手势传感器系统。在一些示例中，接口系统55可以包括控制系统60与存储器系统(如图1A中示出的可选存储器系统65)之间的一个或多个接口。然而，在一些实例中，控制系统60可以包括存储器系统。在一些实施方式中，接口系统55可以被配置用于从环境中的一个或多个麦克风接收输入。The interface system 55 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some embodiments, the interface system 55 may include one or more wireless interfaces. The interface system 55 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, the interface system 55 may include one or more interfaces between the control system 60 and a memory system (such as the optional memory system 65 shown in Figure 1A). However, in some instances, the control system 60 may include a memory system. In some embodiments, the interface system 55 may be configured to receive input from one or more microphones in the environment.

例如，控制系统60可以包括通用单芯片或多芯片处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其他可编程逻辑设备、离散门或晶体管逻辑和/或离散硬件部件。For example, the control system 60 may include a general purpose single-chip or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.

在一些实施方式中，控制系统60可以驻留在多于一个设备中。例如，在一些实施方式中，控制系统60的一部分可以驻留在本文描绘的环境之一内的设备中，并且控制系统60的另一部分可以驻留在环境之外的设备中，如服务器、移动设备(例如，智能电话或平板计算机)等。在其他示例中，控制系统60的一部分可以驻留在本文描绘的环境之一内的设备中，并且控制系统60的另一部分可以驻留在环境的一个或多个其他设备中。例如，控制系统的功能可以跨环境的多个智能音频设备分布，或者可以由编排设备(如本文中可以被称为智能家居中枢的设备)和环境的一个或多个其他设备共享。在其他示例中，控制系统60的一部分可以驻留在实施基于云的服务的设备(如服务器)中，并且控制系统60的另一部分可以驻留在实施基于云的服务的另一设备(如另一服务器、存储器设备等)中。在一些示例中，接口系统55还可以驻留在多于一个设备中。In some embodiments, the control system 60 may reside in more than one device. For example, in some embodiments, a portion of the control system 60 may reside in a device within one of the environments depicted herein, and another portion of the control system 60 may reside in a device outside the environment, such as a server, a mobile device (e.g., a smart phone or a tablet computer), etc. In other examples, a portion of the control system 60 may reside in a device within one of the environments depicted herein, and another portion of the control system 60 may reside in one or more other devices of the environment. For example, the functionality of the control system may be distributed across multiple smart audio devices of the environment, or may be shared by an orchestration device (such as a device that may be referred to as a smart home hub herein) and one or more other devices of the environment. In other examples, a portion of the control system 60 may reside in a device (such as a server) that implements a cloud-based service, and another portion of the control system 60 may reside in another device (such as another server, a memory device, etc.) that implements a cloud-based service. In some examples, the interface system 55 may also reside in more than one device.

在一些实施方式中，控制系统60可以被配置用于至少部分地执行本文公开的方法。根据一些示例，控制系统60可以被配置为获得多个回声参考。多个回声参考可以包括针对音频环境中的多个音频设备中的每个音频设备的至少一个回声参考。每个回声参考可以例如对应于由多个音频设备中的一个音频设备的一个或多个扩音器回放的音频数据。In some embodiments, the control system 60 may be configured to at least partially perform the method disclosed herein. According to some examples, the control system 60 may be configured to obtain multiple echo references. The multiple echo references may include at least one echo reference for each audio device in the multiple audio devices in the audio environment. Each echo reference may, for example, correspond to audio data played back by one or more loudspeakers of an audio device in the multiple audio devices.

在一些实施方式中，控制系统60可以被配置为对多个回声参考中的每个回声参考做出重要性估计。在一些示例中，做出重要性估计可以涉及确定每个回声参考对由音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻的预期贡献。(多个)回声管理系统可以包括声学回声消除器(AEC)和/或声学回声抑制器(AES)。In some embodiments, the control system 60 can be configured to make an importance estimate for each of the plurality of echo references. In some examples, making the importance estimate can involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. The echo management system(s) can include an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES).

根据一些示例，控制系统60可以被配置为至少部分地基于重要性估计来选择一个或多个所选回声参考。在一些示例中，控制系统60可以被配置为将一个或多个所选回声参考提供给至少一个回声管理系统。According to some examples, the control system 60 may be configured to select the one or more selected echo references based at least in part on the importance estimate.In some examples, the control system 60 may be configured to provide the one or more selected echo references to at least one echo management system.

本文描述的一些或所有方法可以由一个或多个设备根据存储在一个或多个非暂态介质上的指令(例如，软件)来执行。这样的非暂态介质可以包括如本文描述的存储器设备等存储器设备，包括但不限于随机存取存储器(RAM)设备、只读存储器(ROM)设备等。一个或多个非暂态介质可以例如驻留在图1A中示出的可选存储器系统65和/或控制系统60中。因此，可以在其上存储有软件的一个或多个非暂态介质中实施本公开中所描述的主题的各个创新方面。例如，所述软件可以包括用于控制至少一个设备执行本文公开的一些或所有方法的指令。例如，软件可以由如图1A的控制系统60等控制系统的一个或多个部件执行。Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transient media. Such non-transient media may include memory devices such as memory devices described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. One or more non-transient media may, for example, reside in the optional memory system 65 and/or control system 60 shown in Figure 1A. Therefore, various innovative aspects of the subject matter described in this disclosure may be implemented in one or more non-transient media on which software is stored. For example, the software may include instructions for controlling at least one device to perform some or all of the methods disclosed herein. For example, software may be executed by one or more components of a control system such as the control system 60 of Figure 1A.

在一些示例中，装置50可以包括图1A中示出的可选麦克风系统70。可选麦克风系统70可以包括一个或多个麦克风。根据一些示例，可选麦克风系统70可以包括麦克风阵列。在一些示例中，麦克风阵列可以被配置成例如根据来自控制系统60的指令来确定到达方向(DOA)和/或到达时间(TOA)信息。在一些实例中，麦克风阵列可以被配置用于例如根据来自控制系统60的指令来进行接收侧波束成形。在一些实施方式中，一个或多个麦克风可以是另一设备(如扬声器系统的扬声器、智能音频设备等)的一部分或与其相关联。在一些示例中，装置50可以不包括麦克风系统70。然而，在一些这样的实施方式中，装置50仍然可以被配置为经由接口系统60接收音频环境中的一个或多个麦克风的麦克风数据。在一些这样的实施方式中，装置50的基于云的实施方式可以被配置成经由接口系统60从音频环境中的一个或多个麦克风接收麦克风数据或与麦克风数据相对应的数据。In some examples, the device 50 may include an optional microphone system 70 shown in FIG. 1A. The optional microphone system 70 may include one or more microphones. According to some examples, the optional microphone system 70 may include a microphone array. In some examples, the microphone array may be configured to determine direction of arrival (DOA) and/or time of arrival (TOA) information, for example, according to instructions from the control system 60. In some instances, the microphone array may be configured to perform receiving side beamforming, for example, according to instructions from the control system 60. In some embodiments, one or more microphones may be part of or associated with another device (such as a speaker of a speaker system, a smart audio device, etc.). In some examples, the device 50 may not include the microphone system 70. However, in some such embodiments, the device 50 may still be configured to receive microphone data of one or more microphones in the audio environment via the interface system 60. In some such embodiments, the cloud-based implementation of the device 50 may be configured to receive microphone data or data corresponding to the microphone data from one or more microphones in the audio environment via the interface system 60.

根据一些实施方式，装置50可以包括图1A中示出的可选扩音器系统75。可选扩音器系统75可以包括一个或多个扩音器，所述扩音器在本文中也可以被称为“扬声器”，或更通常地被称为“音频再现换能器”。在一些示例(例如，基于云的实施方式)中，装置50可以不包括扩音器系统75。According to some embodiments, the device 50 may include the optional loudspeaker system 75 shown in FIG. 1A. The optional loudspeaker system 75 may include one or more loudspeakers, which may also be referred to herein as "speakers," or more generally as "audio reproduction transducers." In some examples (e.g., cloud-based embodiments), the device 50 may not include the loudspeaker system 75.

在一些实施方式中，装置50可以包括图1A中示出的可选传感器系统80。可选传感器系统80可以包括一个或多个触摸传感器、手势传感器、运动检测器等。根据一些实施方式，可选传感器系统80可以包括一个或多个相机。在一些实施方式中，相机可以是独立式相机。在一些示例中，可选传感器系统80的一个或多个相机可以驻留在智能音频设备中，所述智能音频设备可以是单一用途音频设备或虚拟助理。在一些这样的示例中，可选传感器系统80的一个或多个相机可以驻留在电视、移动电话或智能扬声器中。在一些示例中，装置50可以不包括传感器系统80。然而，在一些这样的实施方式中，装置50仍然可以被配置为经由接口系统60接收音频环境中的一个或多个传感器的传感器数据。In some embodiments, the device 50 may include the optional sensor system 80 shown in Figure 1A. The optional sensor system 80 may include one or more touch sensors, gesture sensors, motion detectors, etc. According to some embodiments, the optional sensor system 80 may include one or more cameras. In some embodiments, the camera may be a stand-alone camera. In some examples, one or more cameras of the optional sensor system 80 may reside in a smart audio device, which may be a single-purpose audio device or a virtual assistant. In some such examples, one or more cameras of the optional sensor system 80 may reside in a television, a mobile phone, or a smart speaker. In some examples, the device 50 may not include the sensor system 80. However, in some such embodiments, the device 50 may still be configured to receive sensor data of one or more sensors in the audio environment via the interface system 60.

在一些实施方式中，装置50可以包括图1A中示出的可选显示系统85。可选显示系统85可以包括一个或多个显示器，如一个或多个发光二极管(LED)显示器。在一些实例中，可选显示系统85可以包括一个或多个有机发光二极管(OLED)显示器。在一些示例中，可选显示系统85可以包括智能音频设备的一个或多个显示器。在其他示例中，可选显示系统85可以包括电视显示器、膝上型计算机显示器、移动设备显示器、或另一种类型的显示器。在装置50包括显示系统85的一些示例中，传感器系统80可以包括接近显示系统85的一个或多个显示器的触摸传感器系统和/或手势传感器系统。根据一些这样的实施方式，控制系统60可以被配置用于控制显示系统85来呈现一个或多个图形用户界面(GUI)。In some embodiments, the device 50 may include an optional display system 85 shown in Figure 1A. The optional display system 85 may include one or more displays, such as one or more light emitting diode (LED) displays. In some instances, the optional display system 85 may include one or more organic light emitting diode (OLED) displays. In some examples, the optional display system 85 may include one or more displays of a smart audio device. In other examples, the optional display system 85 may include a television display, a laptop computer display, a mobile device display, or another type of display. In some examples where the device 50 includes a display system 85, the sensor system 80 may include a touch sensor system and/or a gesture sensor system close to one or more displays of the display system 85. According to some such embodiments, the control system 60 may be configured to control the display system 85 to present one or more graphical user interfaces (GUIs).

根据一些这样的示例，装置50可以是或者可以包括智能音频设备。在一些这样的实施方式中，装置50可以是或者可以包括唤醒词检测器。例如，装置50可以是或者可以包括虚拟助理。According to some such examples, the apparatus 50 may be or may include a smart audio device. In some such implementations, the apparatus 50 may be or may include a wake-up word detector. For example, the apparatus 50 may be or may include a virtual assistant.

对于立体声或单声道的回放媒体，传统上它是经由物理线缆连接到音频播放器(例如，CD/DVD播放器、电视(TV)等)的一对扬声器渲染到音频环境(例如，生活空间、汽车、办公空间等)中的。随着智能扬声器的流行，用户通常在其家中(或其他音频环境)中拥有多于两个能够回放音频的被配置为无线通信的音频设备(其可以包括但不限于智能扬声器或其他智能音频设备)。For stereo or mono playback media, it is traditionally rendered into an audio environment (e.g., living space, car, office space, etc.) via a pair of speakers connected to an audio player (e.g., CD/DVD player, television (TV), etc.) via physical cables. With the popularity of smart speakers, users typically have more than two audio devices (which may include but are not limited to smart speakers or other smart audio devices) configured for wireless communication that are capable of playing back audio in their homes (or other audio environments).

智能扬声器通常被配置为根据语音命令进行操作。因此，这种智能扬声器通常被配置为连续收听唤醒词，唤醒词之后将通常跟着语音命令。任何连续收听任务(比如等待唤醒词或执行任何类型的“连续校准”)将优选地在内容回放(比如音乐回放、电影和电视节目的音轨回放等)以及发生设备交互时(例如，在电话通话期间)继续运行。需要在内容回放期间收听的音频设备通常需要采用某种形式的回声管理，例如回声消除和/或回声抑制，以从麦克风信号中去除“回声”(由设备播放的内容)。Smart speakers are typically configured to operate based on voice commands. Therefore, such smart speakers are typically configured to continuously listen for a wake word, which will typically be followed by a voice command. Any continuous listening tasks (such as waiting for the wake word or performing any type of "continuous calibration") will preferably continue to run during content playback (such as music playback, movie and TV show soundtrack playback, etc.) and when device interaction occurs (for example, during a phone call). Audio devices that need to listen during content playback typically need to employ some form of echo management, such as echo cancellation and/or echo suppression, to remove the "echo" (the content played by the device) from the microphone signal.

图1B示出了音频环境的示例。与本文提供的其他图一样，图1B中示出的要素的类型、数量和布置仅作为示例提供。其他实施方式可以包括更多、更少和/或不同类型、数量和/或布置的要素。Fig. 1B shows an example of an audio environment. As with other figures provided herein, the types, quantities, and arrangements of the elements shown in Fig. 1B are provided only as examples. Other embodiments may include more, fewer, and/or different types, quantities, and/or arrangements of elements.

根据该示例，音频环境100包括音频设备110A、110B和110C。在该示例中，音频设备110A-110C中的每一个是图1A的装置50的实例，并且包括麦克风系统70和扩音器系统75的实例，但这些在图1B中未示出。根据一些示例，每个音频设备110A-110C可以是智能音频设备，如智能扬声器。According to this example, the audio environment 100 includes audio devices 110A, 110B, and 110C. In this example, each of the audio devices 110A-110C is an instance of the apparatus 50 of FIG. 1A and includes an instance of the microphone system 70 and the loudspeaker system 75, but these are not shown in FIG. 1B. According to some examples, each audio device 110A-110C can be a smart audio device, such as a smart speaker.

在该示例中，音频设备110A-110C在人130正在说话的同时回放音频内容。音频设备110B的麦克风不仅检测由其自身的扬声器回放的音频内容，而且还检测人130的语音声音131以及由音频设备110A和110C回放的音频内容。In this example, audio devices 110A-110C play back audio content while person 130 is speaking. The microphone of audio device 110B detects not only the audio content played back by its own speaker, but also the voice sound 131 of person 130 and the audio content played back by audio devices 110A and 110C.

为了同时利用尽可能多的扬声器，典型的方法是让音频环境中的所有音频设备回放相同的内容，并使用某种定时机制来使回放媒体保持同步。这样做的优点是使分发变得简单，因为所有设备都会收到相同的回放媒体副本，无论是下载或流式传输到每个音频设备，还是由一个设备广播并多播到所有音频设备。In order to utilize as many speakers as possible simultaneously, the typical approach is to have all audio devices in the audio environment play back the same content and use some kind of timing mechanism to keep the playback media synchronized. This has the advantage of making distribution simple, because all devices receive the same copy of the playback media, whether it is downloaded or streamed to each audio device, or broadcast by one device and multicast to all audio devices.

这种方法的一个主要缺点是无法获得空间效果。空间效果可以通过添加更多回放声道(例如每个扬声器一个)来实现，例如通过上混合。在一些示例中，空间效果可以经由诸如质心振幅平移(CMAP)、灵活虚拟化(FV)、或CMAP和FV的组合等灵活渲染过程来实现。CMAP、FV及其组合的相关示例描述于国际专利公开号WO 2021/021707A1(例如，第25-41页)中，该专利特此通过引用并入。A major drawback of this approach is that spatial effects cannot be obtained. Spatial effects can be achieved by adding more playback channels (e.g., one for each speaker), for example, by upmixing. In some examples, spatial effects can be achieved via flexible rendering processes such as center of mass amplitude translation (CMAP), flexible virtualization (FV), or a combination of CMAP and FV. Related examples of CMAP, FV, and combinations thereof are described in International Patent Publication No. WO 2021/021707A1 (e.g., pages 25-41), which is hereby incorporated by reference.

图1C和图1D示出了音频环境中的音频设备的附加示例。根据这些示例，音频环境100包括智能家居中枢105和音频设备110A、110B和110C。在这些示例中，智能家居中枢105和音频设备110A-110C是图1A的装置50的实例。根据这些示例，音频设备110A-110C中的每一个包括对应的一个扩音器121A、121B和121C。根据一些示例，每个音频设备110A-110C可以是智能音频设备，如智能扬声器。1C and 1D illustrate additional examples of audio devices in an audio environment. According to these examples, the audio environment 100 includes a smart home hub 105 and audio devices 110A, 110B, and 110C. In these examples, the smart home hub 105 and the audio devices 110A-110C are instances of the apparatus 50 of FIG. 1A. According to these examples, each of the audio devices 110A-110C includes a corresponding one of the loudspeakers 121A, 121B, and 121C. According to some examples, each of the audio devices 110A-110C can be a smart audio device, such as a smart speaker.

图1C和图1D示出了音频设备110A-110C可以如何接收回放声道的示例。在图1C中，已编码的音频比特流被多播到所有音频设备110A-110C。在图1D中，音频设备110A-110C中的每一个仅接收该音频设备进行回放所需的声道。比特流分发的选择可以根据单独的实施方式而变化，并且可以例如基于可用的系统带宽、所使用的音频编解码器的编解码效率、音频设备110A-110C的能力和/或其他因素。图1C和图1D中所示的音频环境的确切拓扑并不重要。然而，这些示例说明了这样一个事实：将音频声道分发给设备音频设备将产生一些成本。成本可以从所需的网络带宽、对音频声道进行编码解码所增加的计算成本等方面进行评估。Fig. 1C and Fig. 1D show examples of how audio devices 110A-110C can receive playback channels. In Fig. 1C, the encoded audio bitstream is multicast to all audio devices 110A-110C. In Fig. 1D, each of the audio devices 110A-110C only receives the channels required for playback by the audio device. The selection of bitstream distribution can change according to a separate implementation, and can be based on available system bandwidth, the codec efficiency of the audio codec used, the ability of audio devices 110A-110C and/or other factors. The exact topology of the audio environment shown in Fig. 1C and Fig. 1D is not important. However, these examples illustrate the fact that audio channels will be distributed to device audio devices and some costs will be generated. Cost can be evaluated from aspects such as required network bandwidth, the increased computational cost of encoding and decoding the audio channels.

图1E示出了音频环境的另一示例。根据该示例，音频环境100包括音频设备110A、110B、110C和110D。在该示例中，音频设备110A-110D中的每一个是图1A的装置50的实例并且包括至少一个麦克风(参见麦克风120A、120B、120C和120D)、至少一个扩音器(参见扩音器121A、121B、121C和121D)。根据一些示例，每个音频设备110A-110D可以是智能音频设备，如智能扬声器。FIG. 1E shows another example of an audio environment. According to this example, the audio environment 100 includes audio devices 110A, 110B, 110C, and 110D. In this example, each of the audio devices 110A-110D is an instance of the apparatus 50 of FIG. 1A and includes at least one microphone (see microphones 120A, 120B, 120C, and 120D), at least one loudspeaker (see loudspeakers 121A, 121B, 121C, and 121D). According to some examples, each audio device 110A-110D can be an intelligent audio device, such as an intelligent speaker.

在该示例中，音频设备110A-110D经由扩音器121A-121D渲染内容122A、122B、122C和122D。麦克风120A-120D中的每一个检测到与由音频设备110A-110D中的每一个回放的内容122A-122D相对应的“回声”。在该示例中，音频设备110A-110D被配置为收听来自音频环境100内的人130的语音131中的命令或唤醒词。In this example, the audio devices 110A-110D render content 122A, 122B, 122C, and 122D via loudspeakers 121A-121D. Each of the microphones 120A-120D detects an "echo" corresponding to the content 122A-122D played back by each of the audio devices 110A-110D. In this example, the audio devices 110A-110D are configured to listen for a command or wake-up word in the voice 131 from a person 130 within the audio environment 100.

图2A呈现了能够执行至少一些所公开的实施方式的音频设备的框图。与本文提供的其他图一样，图2A中示出的要素的类型、数量和布置仅作为示例提供。其他实施方式可以包括更多、更少和/或不同类型、数量和/或布置的要素。在该示例中，音频设备110A是图1E的音频设备110A的实例。在此，音频设备110A包括控制系统60a，其是图1A的控制系统60的实例。根据该实施方式，控制系统60能够在存在与由音频环境100中的每个音频设备回放的内容122A、122B、122C和122D相对应的回声的情况下收听人130的语音131。Fig. 2A presents a block diagram of an audio device that can perform at least some disclosed embodiments. As with other figures provided herein, the type, quantity and arrangement of the elements shown in Fig. 2A are provided only as examples. Other embodiments may include elements of more, less and/or different types, quantities and/or arrangements. In this example, audio device 110A is an example of the audio device 110A of Fig. 1E. Here, audio device 110A includes control system 60a, which is an example of the control system 60 of Fig. 1A. According to this embodiment, control system 60 can listen to the voice 131 of a person 130 in the presence of an echo corresponding to the content 122A, 122B, 122C and 122D played back by each audio device in audio environment 100.

根据该示例，控制系统60实施渲染器201A、多声道声学回声管理系统(MC-EMS)203A和语音处理块240A。MC-EMS203A可以包括声学回声消除器(AEC)、声学回声抑制器(AES)、或者AEC和AES两者，具体取决于特定实施方式。根据该示例，语音处理块240A被配置为检测用户的唤醒词和命令。在一些实施方式中，语音处理块240A可以被配置为支持通信会话，比如电话通话。According to this example, the control system 60 implements a renderer 201A, a multi-channel acoustic echo management system (MC-EMS) 203A, and a voice processing block 240A. The MC-EMS 203A may include an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both AEC and AES, depending on the specific implementation. According to this example, the voice processing block 240A is configured to detect the user's wake-up words and commands. In some embodiments, the voice processing block 240A may be configured to support communication sessions, such as telephone calls.

在该实施方式中，渲染器201A被配置为向MC-EMS203A提供本地回声参考220A。本地回声参考220A对应于(并且在该示例中等同于)提供给扩音器121A以供音频设备110A回放的扬声器馈送信号。根据该示例，渲染器201A还被配置为向MC-EMS203A提供非本地回声参考221A(对应于由音频环境100中的其他音频设备回放的内容122B、122C和122D)。In this embodiment, the renderer 201A is configured to provide a local echo reference 220A to the MC-EMS 203A. The local echo reference 220A corresponds to (and is equivalent to in this example) a speaker feed signal provided to the loudspeaker 121A for playback by the audio device 110A. According to this example, the renderer 201A is also configured to provide a non-local echo reference 221A (corresponding to content 122B, 122C, and 122D played back by other audio devices in the audio environment 100) to the MC-EMS 203A.

根据一些示例，音频设备110A接收包括图1E的所有音频设备110A-110D的音频数据的组合比特流(例如，如图1C所示)。在一些这样的示例中，渲染器201A可以被配置为将本地回声参考220A与非本地回声参考221A分开，以向扩音器121A提供本地回声参考220A，并向MC-EMS203A提供本地回声参考220A和非本地回声参考221A。在一些替代示例中，音频设备110A可以接收仅旨在在音频设备110A上回放的比特流，例如，如图1D所示。在一些这样的示例中，智能家居中枢105(或其他音频设备110B-D)可以向音频设备110A提供非本地回声参考221A，如图2A中附图标记221A旁边的虚线箭头所示。According to some examples, the audio device 110A receives a combined bitstream of audio data including all audio devices 110A-110D of FIG. 1E (e.g., as shown in FIG. 1C ). In some such examples, the renderer 201A may be configured to separate the local echo reference 220A from the non-local echo reference 221A to provide the local echo reference 220A to the loudspeaker 121A, and to provide the local echo reference 220A and the non-local echo reference 221A to the MC-EMS 203A. In some alternative examples, the audio device 110A may receive a bitstream intended only for playback on the audio device 110A, for example, as shown in FIG. 1D . In some such examples, the smart home hub 105 (or other audio devices 110B-D) may provide the non-local echo reference 221A to the audio device 110A, as shown by the dashed arrow next to the reference numeral 221A in FIG. 2A .

在一些实例中，本地回声参考220A和/或非本地回声参考221A可以是提供给扩音器121A-121D以供回放的扬声器馈送信号的全保真度复制品。在一些替代示例中，本地回声参考220A和/或非本地回声参考221A可以是提供给扩音器121A-121D以供回放的扬声器馈送信号的较低保真度表示。在一些这样的示例中，非本地回声参考221A可以是提供给扩音器121B-121D以供回放的扬声器馈送信号的下采样版本。根据一些示例，非本地回声参考221A可以是提供给扩音器121B-121D以供回放的扬声器馈送信号的有损压缩。在一些示例中，非本地回声参考221A可以是与提供给扩音器121B-121D以供回放的扬声器馈送信号相对应的分段功率信息(banded power information)。In some instances, the local echo reference 220A and/or the non-local echo reference 221A may be a full-fidelity replica of the speaker feed signals provided to the loudspeakers 121A-121D for playback. In some alternative examples, the local echo reference 220A and/or the non-local echo reference 221A may be a lower fidelity representation of the speaker feed signals provided to the loudspeakers 121A-121D for playback. In some such examples, the non-local echo reference 221A may be a downsampled version of the speaker feed signals provided to the loudspeakers 121B-121D for playback. According to some examples, the non-local echo reference 221A may be a lossy compression of the speaker feed signals provided to the loudspeakers 121B-121D for playback. In some examples, the non-local echo reference 221A may be banded power information corresponding to the speaker feed signals provided to the loudspeakers 121B-121D for playback.

根据该实施方式，MC-EMS203A被配置为使用本地回声参考220A和非本地回声参考221A来预测并消除和/或抑制来自麦克风信号223A的回声，从而产生残差信号224A，在该残差信号中，语音回声比(SER)相对于麦克风信号223A可能已经得到改善。该残差信号224A可以使得语音处理块240A能够检测用户唤醒词和命令。在一些实施方式中，语音处理块240A可以被配置为支持通信会话，比如电话通话。According to this embodiment, MC-EMS 203A is configured to use local echo reference 220A and non-local echo reference 221A to predict and eliminate and/or suppress echo from microphone signal 223A, thereby generating residual signal 224A, in which the speech echo ratio (SER) may have been improved relative to microphone signal 223A. The residual signal 224A can enable the voice processing block 240A to detect user wake-up words and commands. In some embodiments, the voice processing block 240A can be configured to support a communication session, such as a telephone call.

本公开的一些方面涉及对多个回声参考中的每个回声参考(例如，对本地回声参考220A和非本地回声参考221A)做出重要性估计。做出重要性估计可以涉及确定每个回声参考对由音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻(例如，音频设备110A的MC-EMS203A进行的回声减轻)的预期贡献。下文中提供了各种示例。Some aspects of the present disclosure involve making an importance estimate for each echo reference in a plurality of echo references (e.g., for local echo reference 220A and non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation performed by at least one echo management system of at least one audio device of an audio environment (e.g., echo mitigation performed by MC-EMS 203A of audio device 110A). Various examples are provided below.

在分布式和编排式设备的背景下，出于回声管理的目的，根据一些示例，除了自身的回声参考之外，每个音频设备还可以获得与音频环境中一个或多个其他音频设备回放的内容相对应的回声参考。将特定回声参考包括在本地回声管理系统或“EMS”(比如音频设备110A的MC-EMS203A)中的影响可以根据多个参数而变化，诸如正在播出的音频内容的多样性、用于传输回声参考所需的网络带宽、在传输已编码回声参考的情况下用于编码回声参考的编码计算要求、用于解码回声参考的解码计算要求、用于由回声管理系统使用回声参考的回声管理系统计算要求、音频设备的相对可听度等。In the context of distributed and orchestrated devices, for purposes of echo management, according to some examples, each audio device may, in addition to its own echo reference, also obtain an echo reference corresponding to content played back by one or more other audio devices in the audio environment. The impact of including a particular echo reference in a local echo management system or "EMS" (e.g., MC-EMS 203A of audio device 110A) may vary depending on a number of parameters, such as the diversity of the audio content being played, the network bandwidth required for transmitting the echo reference, the encoding computational requirements for encoding the echo reference in the case of transmitting an encoded echo reference, the decoding computational requirements for decoding the echo reference, the echo management system computational requirements for use of the echo reference by the echo management system, the relative audibility of the audio devices, etc.

例如，如果每个音频设备正在渲染相同的内容(换句话说，如果正在回放单声道音频)，则向EMS提供附加参考几乎没有(尽管非零)益处。此外，由于实际限制(比如带宽受限的网络)，可能不希望所有设备都共享其本地回声参考的复制品。因此，一些实施方式可以提供分布式和编排式EMS(DOEMS)，其中，对回声参考进行优先级排序并相应地传输(或不传输)。一些这样的示例可以实施每个附加回声参考的成本(例如，所需的网络带宽和/或所需的计算开销)与效益(例如，预期的回声减轻改进，其可以根据信号回声比(SER)和/或回声损失增强(ERLE)来度量)之间的权衡。For example, if each audio device is rendering the same content (in other words, if mono audio is being played back), there is little (although non-zero) benefit to providing additional references to the EMS. Furthermore, due to practical limitations (such as bandwidth-constrained networks), it may not be desirable for all devices to share a copy of their local echo reference. Therefore, some embodiments may provide a distributed and orchestrated EMS (DOEMS) in which echo references are prioritized and transmitted (or not) accordingly. Some such examples may implement a trade-off between the cost of each additional echo reference (e.g., the required network bandwidth and/or the required computational overhead) and the benefit (e.g., the expected echo mitigation improvement, which may be measured in terms of signal-to-echo ratio (SER) and/or echo loss enhancement (ERLE)).

图2B和图2C示出了音频环境中的音频设备的附加示例。根据这些示例，音频环境100包括智能家居中枢105和音频设备110A、110B和110C。在这些示例中，智能家居中枢105和音频设备110A-110C是图1A的装置50的实例。根据这些示例，音频设备110A-110C中的每一个包括对应的一个麦克风120A、120B和120C以及对应的一个扩音器121A、121B和121C。根据一些示例，每个音频设备110A-110C可以是智能音频设备，如智能扬声器。2B and 2C show additional examples of audio devices in an audio environment. According to these examples, the audio environment 100 includes a smart home hub 105 and audio devices 110A, 110B, and 110C. In these examples, the smart home hub 105 and the audio devices 110A-110C are instances of the apparatus 50 of FIG. 1A. According to these examples, each of the audio devices 110A-110C includes a corresponding microphone 120A, 120B, and 120C and a corresponding loudspeaker 121A, 121B, and 121C. According to some examples, each audio device 110A-110C can be a smart audio device, such as a smart speaker.

在图2B中，智能家居中枢105将相同的已编码音频比特流发送到所有音频设备110A-110C。在图2C中，智能家居中枢105仅发送每个音频设备110A-110C进行回放所需的音频声道。在这两个示例中，音频声道0旨在用于在音频设备110A上回放，音频声道1旨在用于在音频设备110B上回放并且音频声道2旨在用于在音频设备110C上回放。In Figure 2B, the smart home hub 105 sends the same encoded audio bitstream to all audio devices 110A-110C. In Figure 2C, the smart home hub 105 sends only the audio channels required by each audio device 110A-110C for playback. In both examples, audio channel 0 is intended for playback on audio device 110A, audio channel 1 is intended for playback on audio device 110B, and audio channel 2 is intended for playback on audio device 110C.

图2B和图2C示出了在本地网络上共享回声参考数据的示例。在这些示例中，音频设备110A通过本地网络向音频设备110B和110C发送回声参考220A’，该回声参考是与音频设备110A的扩音器回放相对应的回声参考。在这些示例中，回声参考220A’与在比特流中找到的声道0音频不同。在一些实例中，回声参考220A’可能不同于声道0音频，因为在音频设备110A上实施了回放后处理。在图2C所示的示例中，不是将组合比特流提供给所有音频设备110A-110C，因此另一设备(诸如音频设备110A或智能家居中枢105)提供回声参考220A’。在图2B中描绘的场景中，即使组合比特流被提供给所有音频设备110A-110C，在一些这样的实例中，可能仍然需要传输回声参考220A’。2B and 2C show examples of sharing echo reference data on a local network. In these examples, audio device 110A sends echo reference 220A' to audio devices 110B and 110C via a local network, which is an echo reference corresponding to the loudspeaker playback of audio device 110A. In these examples, echo reference 220A' is different from the channel 0 audio found in the bitstream. In some instances, echo reference 220A' may be different from channel 0 audio because playback post-processing is implemented on audio device 110A. In the example shown in FIG. 2C, instead of providing a combined bitstream to all audio devices 110A-110C, another device (such as audio device 110A or smart home hub 105) provides echo reference 220A'. In the scenario depicted in FIG. 2B, even if the combined bitstream is provided to all audio devices 110A-110C, in some such instances, it may still be necessary to transmit echo reference 220A'.

在其他示例中，回声参考220A’可能不同于声道0音频，因为回声参考220A’可能不是在音频设备110A上回放的音频数据的全保真度复制品。在一些这样的示例中，回声参考220A’可以对应于在音频设备110A上回放的音频数据，但是可能需要比完整复制品相对较少的数据，并且因此当传输回声参考220A’时可以消耗相对较少的本地网络带宽。In other examples, the echo reference 220A' may differ from the channel 0 audio because the echo reference 220A' may not be a full-fidelity replica of the audio data played back on the audio device 110A. In some such examples, the echo reference 220A' may correspond to the audio data played back on the audio device 110A, but may require relatively less data than a full replica and, therefore, may consume relatively less local network bandwidth when transmitting the echo reference 220A'.

根据一些这样的示例，音频设备110A可以被配置为产生上文参考图2A描述的本地回声参考220A的下采样版本。在一些这样的示例中，回声参考220A’可以是或可以包括下采样版本。According to some such examples, audio device 110A may be configured to generate a downsampled version of local echo reference 220A described above with reference to FIG. 2A. In some such examples, echo reference 220A' may be or may include the downsampled version.

在一些示例中，音频设备110A可以被配置为对本地回声参考220A进行有损压缩。在这种实例中，回声参考220A’可以是控制系统60a对本地回声参考220A应用有损压缩算法的结果。In some examples, the audio device 110A may be configured to lossily compress the local echo reference 220A. In such an instance, the echo reference 220A' may be the result of the control system 60a applying a lossy compression algorithm to the local echo reference 220A.

根据一些示例，音频设备110A可以被配置为向音频设备110B和110C提供与本地回声参考220A相对应的分段功率信息。在一些这样的示例中，代替传输在音频设备110A上回放的音频数据的全保真度复制品，控制系统60a可以被配置为确定在音频设备110A上回放的音频数据的多个频带中的每个频带中的功率水平，并将对应的分段功率信息传输到音频设备110B和110C。在一些这样的示例中，回声参考220A’可以是或者可以包括分段功率信息。According to some examples, audio device 110A may be configured to provide audio devices 110B and 110C with segment power information corresponding to local echo reference 220A. In some such examples, instead of transmitting a full-fidelity replica of the audio data played back on audio device 110A, control system 60a may be configured to determine the power level in each of a plurality of frequency bands of the audio data played back on audio device 110A and transmit the corresponding segment power information to audio devices 110B and 110C. In some such examples, echo reference 220A' may be or may include the segment power information.

图3A呈现了示出根据一个示例的音频设备的部件的框图。与本文提供的其他图一样，图3A中示出的要素的类型、数量和布置仅作为示例提供。其他实施方式可以包括更多、更少和/或不同类型、数量和/或布置的要素。例如，一些实施方式可以被配置为发送和/或接收“原始”回声参考(其可以是在音频设备上再现的音频的完整的全保真度复制品)、在音频设备上再现的音频的低保真度版本或表示(比如下采样版本、通过有损压缩产生的版本、或与在音频设备上再现的音频相对应的分段功率信息)，但不同时发送和/或接收原始版本和低保真度版本。Fig. 3 A presents the block diagram showing the parts of the audio equipment according to an example.As with other figures provided herein, the type, quantity and arrangement of the key element shown in Fig. 3 A are provided only as examples.Other embodiments may include more, less and/or key elements of different types, quantities and/or arrangements.For example, some embodiments may be configured to send and/or receive a "original" echo reference (which may be a complete full-fidelity replica of the audio reproduced on the audio equipment), a low-fidelity version or representation of the audio reproduced on the audio equipment (such as a down-sampled version, a version produced by lossy compression or the segmented power information corresponding to the audio reproduced on the audio equipment), but do not send and/or receive the original version and the low-fidelity version simultaneously.

在该示例中，音频设备110A是图1E的音频设备110A的实例并且包括控制系统60a，该控制系统是图1A的控制系统60的实例。根据该示例，控制系统60a被配置为实施渲染器201A、多声道声学回声管理系统(MC-EMS)203A、语音处理块240A、回声参考编排器302A、解码器303A以及噪声估计器304A。读者可以假设MC-EMS203A和语音处理块240A如上文参考图2A所描述的那样起作用，除非图3A的以下描述另有指示。在该示例中，网络接口301A是上文参考图1A描述的接口系统55的实例。In this example, audio device 110A is an instance of audio device 110A of FIG. 1E and includes control system 60a, which is an instance of control system 60 of FIG. 1A. According to this example, control system 60a is configured to implement renderer 201A, multi-channel acoustic echo management system (MC-EMS) 203A, speech processing block 240A, echo reference arranger 302A, decoder 303A and noise estimator 304A. The reader may assume that MC-EMS 203A and speech processing block 240A function as described above with reference to FIG. 2A, unless otherwise indicated by the following description of FIG. 3A. In this example, network interface 301A is an instance of interface system 55 described above with reference to FIG. 1A.

在该示例中，图3A的要素如下：In this example, the elements of FIG. 3A are as follows:

110A：音频设备；110A: Audio equipment;

120A：代表性麦克风。在一些实施方式中，音频设备110A可以具有多于一个麦克风；120A: Representative microphone. In some embodiments, the audio device 110A may have more than one microphone;

121A：代表性扩音器。在一些实施方式中，音频设备110A可以具有多于一个扩音器；121A: Representative loudspeaker. In some embodiments, the audio device 110A may have more than one loudspeaker;

201A：渲染器，产生针对本地回放的参考和模拟由音频环境中的其他音频设备回放的音频的回声参考；201A: A renderer that generates a reference for local playback and an echo reference that simulates audio played back by other audio devices in the audio environment;

203A：多声道声学回声管理系统(MC-EMS)，其可以包括声学回声消除器(AEC)和/或声学回声抑制器(AES)；203A: A multi-channel acoustic echo management system (MC-EMS), which may include an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES);

220A：用于回放和消除的本地回声参考；220A: Local echo reference for playback and cancellation;

221A：一个或多个非本地音频设备(音频环境中的一个或多个其他音频设备)正在播放的回声参考的本地产生的副本；221A: A locally generated copy of an echo reference being played by one or more non-local audio devices (one or more other audio devices in the audio environment);

223A：多个麦克风信号；223A: Multiple microphone signals;

224A：多个残差信号(MC-EMS203A消除和/或抑制所预测的回声之后的麦克风信号)；224A: multiple residual signals (microphone signals after MC-EMS203A eliminates and/or suppresses the predicted echo);

240A：语音处理块，被配置用于唤醒词检测、语音命令检测和/或提供电话通信；240A: a voice processing block configured for wake-up word detection, voice command detection and/or providing telephone communication;

301A：网络接口，被配置用于音频设备之间的通信，其也可以被配置用于经由因特网和/或经由一个或多个蜂窝网络进行通信；301A: A network interface configured for communication between audio devices, which may also be configured for communication via the Internet and/or via one or more cellular networks;

302A：回声参考编排器，被配置为对回声参考进行排名并选择一个或多个回声参考的适当集合；302A: An echo reference orchestrator configured to rank the echo references and select an appropriate set of one or more echo references;

303A：音频解码器块；303A: audio decoder block;

304A：噪声估计器块；304A: Noise estimator block;

310A：由音频设备110A从音频环境中的一个或多个其他设备接收的一个或多个已解码的回声参考；310A: one or more decoded echo references received by the audio device 110A from one or more other devices in the audio environment;

311A：从一个或多个其他设备(比如智能家居中枢或音频设备110B-110D中的一个或多个)通过本地网络发送回声参考的请求；311A: Sending a request for an echo reference from one or more other devices (such as a smart home hub or one or more of the audio devices 110B-110D) via a local network;

312A：元数据，其可以是或可以包括与音频设备布局相对应的元数据、扩音器元数据、与接收到的音频数据相对应的元数据、上混合矩阵、和/或扩音器激活矩阵；312A: Metadata, which may be or may include metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, and/or a loudspeaker activation matrix;

313A：由回声参考编排器302A选择的回声参考；313A: echo reference selected by echo reference arranger 302A;

314A：设备110A从一个或多个其他设备接收的回声参考；314A: echo reference received by device 110A from one or more other devices;

315A：从设备110A发送到其他设备的回声参考；315A: echo reference sent from device 110A to other devices;

316A：设备110A从音频环境的一个或多个其他设备接收的原始回声参考；316A: raw echo reference received by device 110A from one or more other devices in the audio environment;

317A：设备110A从音频环境的一个或多个其他设备接收的回声参考的低保真度(例如，编解码的)版本；317A: a low-fidelity (eg, decoded) version of an echo reference received by device 110A from one or more other devices of the audio environment;

318A：音频环境噪声估计；318A: Audio environment noise estimation;

350A：指示MC-EMS203A的当前性能的一个或多个指标，其可以是或者可以包括自适应滤波器系数数据或其他AEC统计数据、语音回声(SER)比数据等。350A: One or more indicators indicating the current performance of the MC-EMS 203A, which may be or may include adaptive filter coefficient data or other AEC statistics, speech echo (SER) ratio data, etc.

取决于特定实施方式，回声参考编排器302A可以以各种方式起作用。本文中公开了许多示例。在一些示例中，回声参考编排器302A可以被配置为对多个回声参考中的每个回声参考(例如，对本地回声参考220A和非本地回声参考221A)做出重要性估计。做出重要性估计可以涉及确定每个回声参考对由音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻(例如，音频设备110A的MC-EMS203A进行的回声减轻)的预期贡献。Depending on the particular implementation, the echo reference organizer 302A can function in various ways. Many examples are disclosed herein. In some examples, the echo reference organizer 302A can be configured to make an importance estimate for each of the plurality of echo references (e.g., for the local echo reference 220A and the non-local echo reference 221A). Making the importance estimate can involve determining the expected contribution of each echo reference to the echo mitigation performed by at least one echo management system of at least one audio device of the audio environment (e.g., the echo mitigation performed by the MC-EMS 203A of the audio device 110A).

做出重要性估计的一些示例可以涉及确定重要性度量。在一些这样的示例中，重要性度量可以至少部分地基于每个回声参考的一个或多个特性，比如水平、唯一性、时间持续性、可听度、或其一个或多个组合。在一些示例中，重要性度量可以至少部分地基于元数据(例如，元数据312A)，诸如与音频设备布局相对应的元数据、扩音器元数据、与接收到的音频数据相对应的元数据、上混合矩阵、扩音器激活矩阵、或其一个或多个组合。在一些示例中，重要性度量可以至少部分地基于当前收听目标、当前环境噪声估计、至少一个回声管理系统的当前性能的估计、或其一个或多个组合。Some examples of making importance estimates may involve determining an importance metric. In some such examples, the importance metric may be based at least in part on one or more characteristics of each echo reference, such as level, uniqueness, temporal persistence, audibility, or one or more combinations thereof. In some examples, the importance metric may be based at least in part on metadata (e.g., metadata 312A), such as metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, the importance metric may be based at least in part on current listening objectives, current ambient noise estimates, estimates of current performance of at least one echo management system, or one or more combinations thereof.

根据一些示例，回声参考编排器302A可以被配置为至少部分地基于成本确定来选择一个或多个回声参考的集合。在一些示例中，回声参考编排器302A可以被配置为进行成本确定，而在其他示例中，控制系统60a的另一个块可以被配置为进行成本确定。在一些实例中，成本确定可以涉及确定多个回声参考中的至少一个回声参考的成本，或者在一些情况下确定多个回声参考中的每一个的成本。在一些示例中，成本确定可以基于用于传输回声参考所需的网络带宽、用于编码至少一个回声参考的编码计算要求、用于解码至少一个回声参考的解码计算要求、制作回声参考的下采样版本的下采样成本、由回声管理系统使用至少一个回声参考的回声管理系统计算要求、或其一个或多个组合。According to some examples, the echo reference scheduler 302A can be configured to select a set of one or more echo references based at least in part on a cost determination. In some examples, the echo reference scheduler 302A can be configured to perform the cost determination, while in other examples, another block of the control system 60a can be configured to perform the cost determination. In some instances, the cost determination can involve determining the cost of at least one echo reference in a plurality of echo references, or in some cases determining the cost of each of a plurality of echo references. In some examples, the cost determination can be based on the network bandwidth required for transmitting the echo references, the encoding computational requirements for encoding the at least one echo reference, the decoding computational requirements for decoding the at least one echo reference, the downsampling cost of making a downsampled version of the echo reference, the echo management system computational requirements for using the at least one echo reference by the echo management system, or one or more combinations thereof.

根据一些示例，成本确定可以基于至少一个回声参考在时域或频域中的复制品、至少一个回声参考的下采样版本、至少一个回声参考的有损压缩、至少一个回声参考的分段功率信息、或其一个或多个组合。在一些实例中，成本确定可以基于与相对不太重要的回声参考相比对相对更重要的回声参考进行更少压缩的方法。在一些实施方式中，回声参考编排器302A(或控制系统60a的另一个块)可以被配置为确定当前回声管理系统性能水平(例如，至少部分地基于(多个)指标350A)。在一些这样的示例中，选择一个或多个所选回声参考可以至少部分地基于当前回声管理系统性能水平。According to some examples, the cost determination can be based on a replica of the at least one echo reference in the time domain or frequency domain, a downsampled version of the at least one echo reference, a lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, or one or more combinations thereof. In some instances, the cost determination can be based on a method of compressing relatively more important echo references less than relatively less important echo references. In some embodiments, the echo reference organizer 302A (or another block of the control system 60a) can be configured to determine a current echo management system performance level (e.g., based at least in part on the indicator(s) 350A). In some such examples, selecting one or more selected echo references can be based at least in part on the current echo management system performance level.

取决于分布式音频设备系统、其配置和音频会话的类型(例如，通信或收听音乐)和/或经渲染的内容的性质，估计每个回声参考的重要性的速率和评估回声参考集的速率可能不同。此外，估计重要性的速率不需要等于回声参考选择过程做出决策的速率。如果这两者不同步，则在一些示例中，重要性计算将会更加频繁。在一些实例中，回声参考选择可以是离散过程，在该过程中，做出包括或不包括特定回声参考的二元决策。Depending on the distributed audio device system, its configuration and the type of audio session (e.g., communication or listening to music) and/or the nature of the rendered content, the rate at which the importance of each echo reference is estimated and the rate at which the set of echo references is evaluated may be different. Furthermore, the rate at which the importance is estimated need not be equal to the rate at which the echo reference selection process makes decisions. If the two are not synchronized, then in some examples, the importance calculations will be more frequent. In some instances, echo reference selection can be a discrete process in which a binary decision is made to include or not include a particular echo reference.

图3B和图3C是示出预期回声管理性能与用于回声管理的回声参考的数量的示例的图表。在图3B中，可以发现，随着添加附加参考，预期的回声性能也会提升。然而，在该示例中，可以发现，只有系统可以在其上操作的几个离散点。在一些示例中，图3B中所示的点可以对应于处理每个回声参考的完整的、全保真度复制品。例如，点301可以对应于处理本地回声参考(例如，图2A或图3A的本地参考220A)的实例，并且点310可以对应于接收第一非本地回声参考的完整复制品(例如，图3A的所接收回声参考314A之一的全保真度版本，其可能已被选为最重要的非本地回声参考)并处理本地回声参考和第一非本地回声参考的完整复制品两者的实例。3B and 3C are graphs showing examples of expected echo management performance versus the number of echo references used for echo management. In FIG. 3B , it can be seen that as additional references are added, the expected echo performance also improves. However, in this example, it can be seen that there are only a few discrete points at which the system can operate. In some examples, the points shown in FIG. 3B may correspond to processing a complete, full-fidelity replica of each echo reference. For example, point 301 may correspond to an instance of processing a local echo reference (e.g., local reference 220A of FIG. 2A or FIG. 3A ), and point 310 may correspond to receiving a complete replica of a first non-local echo reference (e.g., a full-fidelity version of one of the received echo references 314A of FIG. 3A , which may have been selected as the most important non-local echo reference) and processing both the local echo reference and the complete replica of the first non-local echo reference.

图3C图示了在图3B所示的离散操作点中的任意两个之间操作的一个示例。连接图3B中的点的线可以例如对应于一定的回声参考保真度范围，包括每个回声参考的较低保真度版本或表示。例如，点303、305和307可以对应于第一非本地回声参考的保真度水平增加的副本或表示，其中，点303对应于最低保真度表示，并且点307对应于除全保真度复制品之外的最高保真度表示。在一些示例中，点303可以对应于第一非本地回声参考的分段功率信息。根据一些示例，点305和307可以分别对应于第一非本地回声参考的相对较高有损压缩和第一非本地回声参考的相对较少有损压缩。FIG3C illustrates an example of operating between any two of the discrete operating points shown in FIG3B . The line connecting the points in FIG3B may, for example, correspond to a range of echo reference fidelity, including lower fidelity versions or representations of each echo reference. For example, points 303, 305, and 307 may correspond to copies or representations of the first non-local echo reference with increased fidelity levels, wherein point 303 corresponds to the lowest fidelity representation and point 307 corresponds to the highest fidelity representation other than the full fidelity replica. In some examples, point 303 may correspond to segmented power information of the first non-local echo reference. According to some examples, points 305 and 307 may correspond to relatively high lossy compression of the first non-local echo reference and relatively less lossy compression of the first non-local echo reference, respectively.

回声参考的副本或表示的保真度通常与每个这样的副本或表示所需的比特数成反比。因此，回声参考的副本或表示的保真度提供了网络成本(由于传输所需的比特数不同)和预期回声管理性能(因为性能应随着保真度增加而提高)之间的权衡的指示。注意，用于连接图3C中的点的直线仅表示许多不同的可能轨迹之一，部分原因是从一个回声参考到下一个回声参考的增量变化取决于将选择哪个回声参考作为下一个回声参考，并且部分原因是预期回声管理性能与保真度之间可能不存在线性关系。The fidelity of a copy or representation of an echo reference is generally inversely proportional to the number of bits required for each such copy or representation. Thus, the fidelity of a copy or representation of an echo reference provides an indication of the tradeoff between network cost (due to the different number of bits required for transmission) and expected echo management performance (since performance should improve with increasing fidelity). Note that the straight line connecting the points in FIG. 3C represents only one of many different possible trajectories, in part because the incremental change from one echo reference to the next depends on which echo reference will be selected as the next echo reference, and in part because there may not be a linear relationship between expected echo management performance and fidelity.

图4呈现了示出根据一个示例的回声参考编排器的部件的框图。与本文提供的其他图一样，图4中示出的要素的类型、数量和布置仅作为示例提供。其他实施方式可以包括更多、更少和/或不同类型、数量和/或布置的要素。例如，一些实施方式可以被配置为发送和/或接收“原始”回声参考(其可以是在音频设备上再现的音频的全保真度复制品)、在音频设备上再现的音频的低保真度版本或表示(比如下采样版本、通过有损压缩产生的版本、或与在音频设备上再现的音频相对应的分段功率信息)，但不同时发送和/或接收原始版本和低保真度版本。Fig. 4 presents a block diagram showing the components of an echo reference organizer according to an example. As with other figures provided herein, the types, quantities and arrangements of the elements shown in Fig. 4 are provided only as examples. Other embodiments may include more, fewer and/or elements of different types, quantities and/or arrangements. For example, some embodiments may be configured to send and/or receive a "original" echo reference (which may be a full-fidelity replica of the audio reproduced on an audio device), a low-fidelity version or representation of the audio reproduced on the audio device (such as a downsampled version, a version produced by lossy compression, or segmented power information corresponding to the audio reproduced on the audio device), but do not send and/or receive the original version and the low-fidelity version simultaneously.

在该示例中，回声参考编排器302A是图3A的回声参考编排器302A的实例并且由图3A的控制系统60a的实例实施。根据该示例，图4的要素如下：In this example, the echo reference organizer 302A is an instance of the echo reference organizer 302A of Figure 3A and is implemented by an instance of the control system 60a of Figure 3A. According to this example, the elements of Figure 4 are as follows:

221A：音频环境的另一音频设备正在播放的非本地回声参考的本地产生的副本；221A: A locally generated copy of a non-local echo reference being played by another audio device of the audio environment;

302A：回声参考编排器，被配置为对一个或多个回声参考的集合进行排名和选择的模块；302A: an echo reference orchestrator, a module configured to rank and select a set of one or more echo references;

311A：从音频环境的一个或多个其他设备通过本地网络发送回声参考的请求；311A: Sending a request for an echo reference from one or more other devices in the audio environment via a local network;

313A：在该示例中，由回声参考编排器302A选择并发送到MC-EMS203A的一个或多个回声参考的集合；313A: in this example, a set of one or more echo references selected by the echo reference scheduler 302A and sent to the MC-EMS 203A;

318A：音频环境噪声估计；318A: Audio environment noise estimation;

401A：回声参考重要性估计器，其被配置为估计每个回声参考的预期重要性，并且在该示例中生成对应的重要性度量420A；401A: an echo reference importance estimator configured to estimate the expected importance of each echo reference and, in this example, generate a corresponding importance measure 420A;

402：回声参考选择器，其被配置为在该示例中至少部分地基于当前收听目标(如421A所示)、每个回声参考的成本(如422A所示)、EMS的当前状态/性能(如350A所示)以及每个候选回声参考的估计重要性(如重要性度量420A所示)来选择回声参考集313A；402: an echo reference selector configured to select an echo reference set 313A based at least in part on a current listening goal (as shown in 421A), a cost of each echo reference (as shown in 422A), a current state/performance of the EMS (as shown in 350A), and an estimated importance of each candidate echo reference (as shown in importance measure 420A) in this example;

403A：成本估计模块，其被配置为确定将回声参考包括在回声参考集313A中的(多个)成本(例如，计算和/或网络成本)；403A: A cost estimation module configured to determine cost(s) (eg, computational and/or network costs) of including an echo reference in the echo reference set 313A;

404A：可选模块，用于确定或估计音频设备110A的当前收听目标；404A: Optional module, used to determine or estimate the current listening target of the audio device 110A;

405A：被配置为实施一个或多个MC-EMS性能模型的模块，其在一些示例中可以产生诸如图3B或图3C中所示的数据；405A: A module configured to implement one or more MC-EMS performance models, which in some examples may generate data such as shown in FIG. 3B or FIG. 3C ;

420A：由回声参考重要性估计器401A生成的重要性度量420A；420A: Importance measure 420A generated by echo reference importance estimator 401A;

421A：指示当前收听目标的信息；421A: Information indicating the current listening target;

422A：指示将回声参考包括在回声参考集313A中的(多个)成本的信息；以及422A: Information indicating the cost(s) of including the echo reference in the echo reference set 313A; and

423A：由MC-EMS性能模型405A产生的信息，其在一些示例中可以是或者包括诸如图3B或图3C中所示的数据。423A: Information generated by the MC-EMS performance model 405A, which in some examples may be or include data such as shown in FIG. 3B or FIG. 3C .

取决于特定实施方式，回声参考重要性估计器401A可以以各种方式起作用。本公开中提供了各种示例。在一些示例中，回声参考重要性估计器401A可以被配置为对多个回声参考中的每个回声参考(例如，对本地回声参考220A和非本地回声参考221A)做出重要性估计。做出重要性估计可以涉及确定每个回声参考对由音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻(例如，音频设备110A的MC-EMS203A进行的回声减轻)的预期贡献。Depending on the particular implementation, the echo reference importance estimator 401A may function in various ways. Various examples are provided in the present disclosure. In some examples, the echo reference importance estimator 401A may be configured to make an importance estimate for each of the plurality of echo references (e.g., for the local echo reference 220A and the non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation performed by at least one echo management system of at least one audio device of the audio environment (e.g., echo mitigation performed by the MC-EMS 203A of the audio device 110A).

在该示例中，做出重要性估计涉及确定重要性度量420A。重要性度量420A可以至少部分地基于每个回声参考的一个或多个特性，比如水平、唯一性、时间持续性、可听度、或其一个或多个组合。在一些示例中，重要性度量可以至少部分地基于元数据(例如，元数据312A)，该元数据可以包括与音频设备布局相对应的元数据、扩音器元数据(例如，声压级(SPL)评级、频率范围、扩音器是否是向上发声的扩音器等)、与接收到的音频数据相对应的元数据(例如，位置元数据、指示人声或其他语音的元数据等)、上混合矩阵、扩音器激活矩阵、或其一个或多个组合。在一些实例中，如虚线箭头420A所示，回声参考重要性估计器401A可以将重要性度量420A提供给MC-EMS性能模型405A。In this example, making an importance estimate involves determining an importance metric 420A. The importance metric 420A may be based at least in part on one or more characteristics of each echo reference, such as level, uniqueness, time duration, audibility, or one or more combinations thereof. In some examples, the importance metric may be based at least in part on metadata (e.g., metadata 312A), which may include metadata corresponding to an audio device layout, loudspeaker metadata (e.g., sound pressure level (SPL) rating, frequency range, whether the loudspeaker is an upward-firing loudspeaker, etc.), metadata corresponding to received audio data (e.g., position metadata, metadata indicating a human voice or other speech, etc.), an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some instances, as shown by dashed arrow 420A, the echo reference importance estimator 401A may provide the importance metric 420A to the MC-EMS performance model 405A.

根据该示例，重要性度量420A至少部分地基于当前收听目标，如信息421A所指示的。如下文更详细描述的，当前收听目标可以显著改变对诸如水平、唯一性、时间持续性、可听度等因素的评估方式。例如，在电话通话期间的重要性分析可能与等待唤醒词时截然不同。According to this example, importance metric 420A is based at least in part on the current listening goal, as indicated by information 421 A. As described in more detail below, the current listening goal can significantly change how factors such as level, uniqueness, temporal duration, audibility, etc. are evaluated. For example, importance analysis during a phone call may be very different than when waiting for a wake word.

在该示例中，重要性度量420A至少部分地基于当前环境噪声估计318A、指示MC-EMS203A的当前性能的(多个)指标350A、由MC-EMS性能模型405A产生的信息423A、或其一个或多个组合。在一些实施方式中，回声参考重要性估计器401A可以确定，如果房间噪声水平相对较高(如当前环境噪声估计318A所指示的)，则添加回声参考将不太可能有助于显著减轻回声。如上所述，信息423A可以对应于上文参考图3B和图3C描述的信息类型，其可以提供回声参考的使用与MC-EMS203A的预期性能增加之间的直接相关性。如下文更详细描述的，EMS的性能可以部分地基于当受到音频环境中的噪声干扰时EMS的稳健性。In this example, the importance metric 420A is based at least in part on the current ambient noise estimate 318A, the (multiple) indicators 350A indicating the current performance of the MC-EMS 203A, the information 423A generated by the MC-EMS performance model 405A, or one or more combinations thereof. In some embodiments, the echo reference importance estimator 401A can determine that if the room noise level is relatively high (as indicated by the current ambient noise estimate 318A), then adding the echo reference will not likely help significantly mitigate the echo. As described above, the information 423A can correspond to the type of information described above with reference to Figures 3B and 3C, which can provide a direct correlation between the use of the echo reference and the expected performance increase of the MC-EMS 203A. As described in more detail below, the performance of the EMS can be based in part on the robustness of the EMS when it is disturbed by noise in the audio environment.

根据该实施方式，回声参考选择器402至少部分地基于以下各项来选择一个或多个回声参考的集合：指示MC-EMS203A的当前性能的一个或多个指标350A、重要性度量420A、当前收听目标421A、指示将回声参考包括在回声参考集313A中的(多个)成本的信息422A、以及由MC-EMS性能模型405A产生的信息423A。下文提供了回声参考选择器402可以如何选择回声参考的一些详细示例。According to this embodiment, the echo reference selector 402 selects a set of one or more echo references based at least in part on: one or more indicators 350A indicating the current performance of the MC-EMS 203A, the importance metric 420A, the current listening goal 421A, information 422A indicating the cost(s) of including the echo reference in the echo reference set 313A, and information 423A generated by the MC-EMS performance model 405A. Some detailed examples of how the echo reference selector 402 can select echo references are provided below.

在该示例中，成本估计模块403A被配置为确定将回声参考包括在回声参考集313A中的计算和/或网络成本。计算成本可以例如包括由MC-EMS203A使用特定回声参考的附加计算成本。该计算成本进而可能取决于表示回声参考所需的比特数。在一些示例中，计算成本可以包括有损回声参考编码过程的计算成本和/或对应回声参考解码过程的计算成本。确定网络成本可以涉及确定跨本地数据网络(例如，本地无线数据网络)发送回声参考的完整复制品或者回声参考的副本或表示所需的数据量。In this example, the cost estimation module 403A is configured to determine the computational and/or network cost of including the echo reference in the echo reference set 313A. The computational cost may, for example, include the additional computational cost of using a particular echo reference by the MC-EMS 203A. The computational cost may in turn depend on the number of bits required to represent the echo reference. In some examples, the computational cost may include the computational cost of a lossy echo reference encoding process and/or the computational cost of a corresponding echo reference decoding process. Determining the network cost may involve determining the amount of data required to send a complete replica of the echo reference or a copy or representation of the echo reference across a local data network (e.g., a local wireless data network).

在一些实例中，回声参考选择块402A可以生成并传输使音频环境中的另一设备通过网络向其发送一个或多个回声参考的请求311A。(图3A的要素314A指示由音频设备110A接收一个或多个回声参考，其在一些实例中可能已经响应于请求311A)。在一些示例中，请求311A可以指定所请求的回声参考的保真度，例如，是否应当发送回声参考的“原始”副本(全保真度复制品)、是否应当发送回声参考的已编码版本、在应当发送回声参考的已编码版本的情况下应当将相对较多还是相对较少的有损压缩算法应用于回声参考、是否应当发送与回声参考相对应的分段功率信息等。In some instances, the echo reference selection block 402A may generate and transmit a request 311A for another device in the audio environment to send one or more echo references to it over the network. (Element 314A of FIG. 3A indicates reception of one or more echo references by the audio device 110A, which in some instances may have responded to the request 311A). In some examples, the request 311A may specify the fidelity of the requested echo reference, such as whether a "raw" copy (full-fidelity replica) of the echo reference should be sent, whether an encoded version of the echo reference should be sent, whether a relatively more or less lossy compression algorithm should be applied to the echo reference if an encoded version of the echo reference should be sent, whether segment power information corresponding to the echo reference should be sent, etc.

人们可能注意到，对已编码回声参考的请求不仅由于发送请求和参考而引入网络成本，而且还增加了(多个)响应设备(例如，智能家居中枢105或音频设备110B-110D中的一个或多个)必须对参考进行编码的计算成本，以及音频设备110A对接收到的参考进行解码的计算成本。然而，该编码成本可能是一次性成本。因此，从一个音频设备到另一音频设备通过网络发送已编码参考的请求改变了在其他设备(例如，在音频设备402C和402D)中执行的潜在性能/成本权衡。One may note that the request for a coded echo reference not only introduces a network cost due to sending the request and reference, but also adds the computational cost of the responding device(s) (e.g., the smart home hub 105 or one or more of the audio devices 110B-110D) having to encode the reference, and the computational cost of the audio device 110A decoding the received reference. However, this encoding cost may be a one-time cost. Thus, sending a request for an encoded reference over the network from one audio device to another changes the potential performance/cost tradeoffs performed in other devices (e.g., in the audio devices 402C and 402D).

在一些实施方式中，回声参考编排器302A的一个或多个块可以由编排设备(例如，智能家居中枢105或音频设备110A-110D之一)来执行。根据一些这样的实施方式，回声参考重要性估计器401A和/或回声参考选择块402A的至少一些功能可以由编排设备来执行。一些这样的实施方式可能能够考虑到音频环境中MC-EMS的所有实例的性能增强、MC-EMS的所有实例的总体计算需求、本地网络的总体需求、和/或所有编码器和解码器的总体计算需求而确定整个系统的成本/效益权衡。In some embodiments, one or more blocks of echo reference scheduler 302A may be performed by a scheduler device (e.g., smart home hub 105 or one of audio devices 110A-110D). According to some such embodiments, at least some functions of echo reference importance estimator 401A and/or echo reference selection block 402A may be performed by a scheduler device. Some such embodiments may be able to determine the cost/benefit tradeoff of the entire system taking into account the performance enhancement of all instances of MC-EMS in the audio environment, the overall computational requirements of all instances of MC-EMS, the overall requirements of the local network, and/or the overall computational requirements of all encoders and decoders.

各种指标和分量的示例Examples of various indicators and weights

重要性度量Importance Metrics

简单地说，重要性度量(本文中可以称为“重要性”或“I”)可以是由于包含特定回声参考而对EMS性能的预期改进的度量。在一些实施例中，重要性可以取决于EMS的当前状态，特别是取决于已经在使用的回声参考集以及它们正在以什么保真度水平被接收。取决于特定的实施方式，重要性可以在不同的时间尺度上获得。在一个极端情况下，重要性可以逐帧地实施(例如，根据每一帧的重要性信号)。在其他示例中，重要性可以被实施为针对内容片段持续时间的恒定值，或者被实施为针对使用音频设备的特定配置的时间的恒定值。音频设备的配置可以对应于音频设备位置和/或音频设备取向。In short, the importance metric (which may be referred to herein as "importance" or "I") may be a measure of the expected improvement in EMS performance due to the inclusion of a specific echo reference. In some embodiments, importance may depend on the current state of the EMS, particularly on the echo reference set already in use and at what fidelity level they are being received. Depending on the specific implementation, importance may be obtained on different time scales. In an extreme case, importance may be implemented frame by frame (e.g., according to an importance signal for each frame). In other examples, importance may be implemented as a constant value for the duration of a content segment, or as a constant value for the time of a specific configuration of an audio device being used. The configuration of an audio device may correspond to an audio device position and/or an audio device orientation.

因此，可以取决于特定的实施方式在各种时间尺度上计算重要性度量，例如：Thus, the importance metric may be calculated at various time scales depending on the particular implementation, for example:

·实时，例如，根据音频环境中的事件(例如来电通话)等来分析当前的音频内容；Real-time, for example, analyzing the current audio content based on events in the audio environment (such as an incoming call);

·在较长的时间尺度上，例如，逐个音轨地，其中，音轨对应于诸如歌曲或可以例如在几分钟的时间尺度上持续的其他音乐内容片段等内容片段；或者on a longer time scale, e.g., track by track, where a track corresponds to a piece of content such as a song or other piece of musical content that may last, e.g., on a time scale of several minutes; or

·仅一次，例如，当音频系统最初配置或重新配置时。Only once, for example, when the audio system is initially configured or reconfigured.

可以在与评估重要性度量的时间尺度相似(或更慢)的时间尺度上做出关于出于回声管理的目的而选择哪些回声参考的决策。例如，设备或系统可能每30秒估计一次重要性，并每几分钟做出关于改变所选回声参考的决策。Decisions about which echo references to select for echo management purposes may be made on a time scale similar to (or slower than) that of evaluating the importance metric. For example, a device or system might estimate importance every 30 seconds and make decisions about changing the selected echo reference every few minutes.

根据一些示例，控制系统可以被配置为确定重要性矩阵，其可以包括当前音频设备系统的所有重要性信息。在一些这样的示例中，重要性矩阵可以具有维度N×M，包括每个音频设备的条目和每个潜在回声参考声道的条目。在一些这样的示例中，N表示音频设备的数量，并且M表示潜在回声参考的数量。由于一些音频设备可能会回放多于一个声道，因此这种类型的重要性矩阵并不总是方形的。According to some examples, the control system can be configured to determine an importance matrix that can include all importance information for the current audio device system. In some such examples, the importance matrix can have dimensions N×M, including an entry for each audio device and an entry for each potential echo reference channel. In some such examples, N represents the number of audio devices and M represents the number of potential echo references. Since some audio devices may play back more than one channel, this type of importance matrix is not always square.

在一些实施方式中，重要性度量I可以基于以下各项中的一项或多项：In some implementations, the importance metric I may be based on one or more of the following:

·L：回声参考的水平；L: level of echo reference;

·U：回声参考的唯一性；U: uniqueness of echo reference;

·P：回声参考的时间持续性，和/或P: the temporal persistence of the echo reference, and/or

·A：渲染回声参考的设备的可听度。A: The audibility of the device rendering the echo reference.

如本文所使用的，首字母缩略词“LUPA”总体上指代可以从中确定重要性度量的回声参考特性，包括但不限于L、U、P和/或A中的一项或多项。As used herein, the acronym "LUPA" generally refers to echo reference characteristics from which importance metrics may be determined, including but not limited to one or more of L, U, P, and/or A.

L或“水平”方面L or "Horizontal" aspect

该方面描述了回声参考的水平或响度。在其他条件相同的情况下，众所周知，回放信号越响，对EMS性能的影响越大。如本文所使用的，术语“水平”是指音频信号的数字表示内的水平，而不一定是指音频信号在经由扩音器再现之后的实际声压级。在一些示例中，回声参考的单个声道的响度可以基于均方根(RMS)指标或LKFS(相对于满刻度的k加权响度)指标。这样的指标很容易在回声参考上实时计算，或者可以作为比特流中的元数据存在。根据一些实施方式，L可以根据音量设置来确定，比如音频系统音量设置或媒体应用内的音量设置。This aspect describes the level or loudness of the echo reference. Under the same other conditions, it is well known that the louder the playback signal, the greater the impact on EMS performance. As used herein, the term "level" refers to the level within the digital representation of the audio signal, and does not necessarily refer to the actual sound pressure level of the audio signal after reproduction via a loudspeaker. In some examples, the loudness of a single channel of the echo reference can be based on a root mean square (RMS) index or a LKFS (k-weighted loudness relative to full scale) index. Such an index is easy to calculate in real time on the echo reference, or can exist as metadata in a bitstream. According to some embodiments, L can be determined based on a volume setting, such as an audio system volume setting or a volume setting in a media application.

U或“唯一性”方面U or the "uniqueness" aspect

唯一性方面旨在捕获特定回声参考所提供的有关整体音频呈现的新信息量。从统计的角度来看，多声道音频呈现通常跨声道包含冗余。例如，这种冗余的出现可能是由于乐器和其他声源在房间左右两侧的声道上被复制，或者信号被平移并因此同时在多个活动扩音器中进一步复制。尽管这种场景导致EMS需要解决超标的问题(其中，回声滤波器可能从多个回声路径来推断观察结果)，但在实践中仍然可以观察到一些益处和更高的性能。The uniqueness aspect aims to capture the amount of new information about the overall audio presentation that a particular echo reference provides. From a statistical point of view, multi-channel audio presentations often contain redundancy across channels. For example, this redundancy may arise due to instruments and other sound sources being replicated on channels on the left and right sides of the room, or the signal being panned and thus further replicated in multiple active loudspeakers simultaneously. Although such scenarios lead to EMSs that need to address overshoot issues (where the echo filter may infer observations from multiple echo paths), some benefits and higher performance can still be observed in practice.

U可以以各种方式来计算或估计。在一些示例中，U可以至少部分地基于每个回声参考之间的相关系数。在一个这样的示例中，U可以如下进行估计：U can be calculated or estimated in various ways. In some examples, U can be based at least in part on the correlation coefficient between each echo reference. In one such example, U can be estimated as follows:

其中，下标“r”对应于所评估的特定回声参考，N表示音频环境中的音频设备的总数，n表示单个音频设备，M表示音频环境中的潜在回声参考的总数，并且m表示单个回声参考。 Wherein the subscript "r" corresponds to the specific echo reference being evaluated, N represents the total number of audio devices in the audio environment, n represents a single audio device, M represents the total number of potential echo references in the audio environment, and m represents a single echo reference.

可替代地或附加地，在一些示例中，U可以至少部分地基于对音频信号进行分解以寻找冗余。一些这样的示例可以涉及瞬时频率估计、基频(F0)估计、频谱图反演和/或非负矩阵因式分解(NMF)。Alternatively or additionally, in some examples, U may be based at least in part on decomposing the audio signal to find redundancy. Some such examples may involve instantaneous frequency estimation, fundamental frequency (F0) estimation, spectrogram inversion, and/or non-negative matrix factorization (NMF).

根据一些示例，U可以至少部分地基于用于矩阵解码的数据。矩阵解码是一种音频技术，其中，少量离散音频声道(例如2个)在回放时被解码为大量声道(例如4或5个)。声道通常被布置用于由编码器传输或记录，并由解码器解码以进行回放。矩阵解码允许将多声道音频(如环绕声)编码为立体声信号，在立体声设备上作为立体声回放，并在环绕声设备上作为环绕声回放。在一个这样的示例中，如果杜比5.1系统正在接收立体声音频数据流，则可以将静态上混合矩阵应用于立体声音频数据，以便为杜比5.1系统中的每个扩音器提供正确渲染的音频。根据一些示例，U可以至少部分地基于用于将音频分配给音频环境的每个扩音器(例如，音频设备110A-110D中的每个)的上混合或下混合(down-mixing)矩阵的系数。According to some examples, U can be based at least in part on data for matrix decoding. Matrix decoding is an audio technique in which a small number of discrete audio channels (e.g., 2) are decoded into a large number of channels (e.g., 4 or 5) when played back. Channels are typically arranged for transmission or recording by an encoder and decoded by a decoder for playback. Matrix decoding allows multi-channel audio (such as surround sound) to be encoded as a stereo signal, played back as stereo on a stereo device, and played back as surround sound on a surround sound device. In one such example, if a Dolby 5.1 system is receiving a stereo audio data stream, a static up-mixing matrix can be applied to the stereo audio data to provide correctly rendered audio for each loudspeaker in the Dolby 5.1 system. According to some examples, U can be based at least in part on the coefficients of an up-mixing or down-mixing matrix for allocating audio to each loudspeaker (e.g., each of audio devices 110A-110D) in an audio environment.

在一些示例中，U可以至少部分地基于在音频环境中使用的标准规范的扩音器布局(例如，杜比5.1、杜比7.1等)。一些这样的示例可以涉及利用传统上在这种规范的扩音器布局中混合和呈现媒体内容的方式。例如，在杜比5.1或杜比7.1系统中，艺术家通常将人声放在中央声道，而不是环绕声道。如上所述，与乐器和其他声源相对应的音频通常在房间左右两侧的声道上被复制。在一些实例中，可以经由与对应的音频数据一起接收的元数据来识别声音、对话、器乐等。In some examples, U can be based at least in part on a standard canonical loudspeaker layout used in an audio environment (e.g., Dolby 5.1, Dolby 7.1, etc.). Some such examples can involve utilizing the way media content is traditionally mixed and presented in such a canonical loudspeaker layout. For example, in a Dolby 5.1 or Dolby 7.1 system, artists typically place vocals in a center channel rather than surround channels. As described above, audio corresponding to musical instruments and other sound sources is typically reproduced on channels on the left and right sides of a room. In some instances, sounds, dialogues, instrumentals, etc. can be identified via metadata received with the corresponding audio data.

P或“持续性”方面P or "persistence" aspect

持续性指标旨在捕获不同类型的回放媒体可能具有大范围的时间持续性这一方面，其中，不同类型的内容具有不同程度的无声和扩音器激活。频谱密集的连续内容流(比如音乐或视频游戏控制台的音频输出)可能具有高水平的时间持续性，而播客可能具有较低水平的时间持续性。不频繁的系统通知的时间持续性水平将非常低。取决于手头的具体列表任务，与具有较低程度的持续性的媒体相对应的回声参考对于EMS来说可能不太重要。例如，偶尔的系统通知不太可能与唤醒词或插话请求发生冲突，因此管理该回声的相对重要性较低。The persistence metric is intended to capture the aspect that different types of playback media may have a wide range of temporal persistence, with different types of content having different degrees of silence and loudspeaker activation. A spectrally dense continuous stream of content (such as music or the audio output of a video game console) may have a high level of temporal persistence, while a podcast may have a lower level of temporal persistence. Infrequent system notifications will have a very low level of temporal persistence. Depending on the specific list task at hand, echo references corresponding to media with a lower degree of persistence may be less important to the EMS. For example, an occasional system notification is unlikely to conflict with a wake word or barge request, so the relative importance of managing that echo is lower.

以下是可以用于度量或估计持续性的指标示例：The following are examples of metrics that can be used to measure or estimate persistence:

·最近历史窗口中回放信号高于特定数字响度阈值的时间百分比；The percentage of time in the recent history window that the playback signal was above a certain numerical loudness threshold;

·表明内容对应于音乐、广播内容、播客或系统声音的元数据标签或媒体分类指示；和/或A metadata tag or media classification indication that the content corresponds to music, radio content, podcast, or system sound; and/or

·最近历史窗口期间回放信号处于人声典型频率范围(例如，100Hz至3KHz)的时间百分比。The percentage of time during the recent history window that the playback signal was in the typical frequency range of human voice (e.g., 100 Hz to 3 KHz).

根据一些示例，音频内容类型可能影响L、U和/或P的估计。例如，知道音频内容是立体声音乐将允许仅使用上述声道指派来对所有回声参考进行排名。可替代地，如果控制系统不分析音频内容，而是依赖于声道指派，则知道音频内容是Atmos可以更改默认的L、U和/或P假设。According to some examples, the audio content type may affect the estimation of L, U and/or P. For example, knowing that the audio content is stereo music would allow all echo references to be ranked using only the channel assignments described above. Alternatively, if the control system does not analyze the audio content, but relies on the channel assignments, knowing that the audio content is Atmos may change the default L, U and/or P assumptions.

A或“可听度”方面A or "audibility" aspect

可听度指标针对以下事实：音频设备具有不同的回放特性，并且在任何给定的音频环境中，音频设备之间的距离可能不同。以下是可以用于度量或估计音频设备可听度的指标示例：Audibility metrics address the fact that audio devices have different playback characteristics and that in any given audio environment, the distances between audio devices may vary. The following are examples of metrics that can be used to measure or estimate the audibility of audio devices:

·音频设备可听度的直接测量结果；Direct measurements of the audibility of audio equipment;

·指包括音频设备的一个或多个扩音器的特性的数据结构，比如额定SPL、频率响应和方向性(例如，扩音器是否是全向的、向前发声、向上发声等)；Refers to a data structure that includes characteristics of one or more loudspeakers of an audio device, such as rated SPL, frequency response, and directivity (e.g., whether the loudspeaker is omnidirectional, forward-firing, upward-firing, etc.);

·基于与音频设备的距离的估计；和/或Based on an estimate of the distance to the audio device; and/or

·上述任意组合。Any combination of the above.

可以评估其他因素以用于估计重要性，并且在一些实例中用于确定重要性度量。Other factors may be evaluated for estimating importance, and in some instances for determining an importance metric.

收听目标Listening Target

收听目标可以定义EMS的背景和期望的性能特性。在一些示例中，收听目标可以修改LUPA评估的参数和/或域。以下讨论将考虑收听目标发生变化的3种潜在背景。在这些不同的背景下，我们将看到概率和关键性可以如何影响LUPA。Listening goals can define the context and desired performance characteristics of the EMS. In some examples, listening goals can modify the parameters and/or domains evaluated by LUPA. The following discussion will consider three potential contexts in which listening goals change. In these different contexts, we will see how probability and criticality can affect LUPA.

1.插话(例如，检测唤醒词的实例)1. Interruption (e.g., detecting instances of wake-up words)

当等待插话时，没有立即的紧迫性：通常认为，用户在未来的所有时间间隔内说出唤醒词的概率是相同的。此外，唤醒词检测器可能是语音助理中最稳健的元件，并且回声泄漏的影响不那么关键。When waiting to be interrupted, there is no immediate urgency: it is generally assumed that the probability of the user saying the wake word is the same at all future time intervals. Furthermore, the wake word detector is probably the most robust element in a voice assistant, and the impact of echo leakage is less critical.

2.命令2. Command

在一个人说出唤醒词之后，这个人立即说出命令的可能性非常高。因此，在不久的将来与回声发生冲突的概率很大。此外，因为命令识别模块可能相对不如唤醒词检测器稳健，所以回声泄漏的关键性通常会很高。After a person says the wake word, the probability that the person will say a command immediately is very high. Therefore, the probability of a collision with the echo in the near future is high. In addition, because the command recognition module may be relatively less robust than the wake word detector, the criticality of the echo leakage is usually high.

3.交流3. Communication

在语音通话期间，任何参与者(音频环境中的(多个)人和远端的(多个)人)彼此交谈的可能性是确定的。换句话说，回声与用户语音冲突的概率本质上是1。然而，由于远端的一个或多个人是人类并且可以很好地应对背景噪声，因此关键性很小，因为他们不太可能受到回声泄漏的困扰。During a voice call, the probability that any of the participants (the person(s) in the audio environment and the person(s) at the far end) are talking to each other is deterministic. In other words, the probability that an echo collides with a user's voice is essentially 1. However, since the person(s) at the far end are human and cope well with background noise, this is less critical since they are unlikely to be bothered by echo leakage.

在这些不同的收听目标背景下，在一些示例中，评估LUPA的方式可能会改变。In the context of these different listening goals, the way LUPA is assessed may change in some examples.

1.插话1. Interject

可能没有时间上的区分，因为认为在所有未来时间间隔说出唤醒词的概率是相同的。因此，控制系统评估LUPA的时间范围可能相当长，以便获得这些参数的更好估计。在一些这样的示例中，控制系统评估LUPA的时间间隔可以被设置为着眼于相对较远的未来(例如，在几分钟的时间范围内)。There may be no temporal distinction, since the probability of saying the wake word at all future time intervals is assumed to be the same. Therefore, the time range over which the control system evaluates LUPA may be quite long in order to obtain better estimates of these parameters. In some such examples, the time interval over which the control system evaluates LUPA may be set to look relatively far into the future (e.g., within a time range of several minutes).

2.命令2. Command

紧接在说出唤醒词之后的时间间隔很可能会说出命令。因此，在检测到唤醒词之后，在一些实施方式中，可以在比插话背景下短得多的时间尺度(例如，大约几秒)上评估LUPA。在一些示例中，由于冲突的可能性很高，所以在该时间间隔期间，在时间上稀疏且在唤醒词检测后的接下来几秒内有内容播放的参考将被认为更加重要。The time interval immediately after the wake word is spoken is likely to be when a command is spoken. Therefore, after the wake word is detected, in some embodiments, LUPA can be evaluated on a much shorter time scale (e.g., on the order of seconds) than in the context of interjection. In some examples, references that are sparse in time and have content playing in the next few seconds after the wake word detection will be considered more important during this time interval because the likelihood of collision is high.

图5A是概述所公开方法的一个示例的流程图。与本文描述的其他方法一样，不必以所指示的顺序来执行方法500的框。在一些示例中，一个或多个框可以同时执行。此外，这样的方法可以包括比所示出和/或所描述的框更多或更少的框。例如，一些实施方式可以不包括框501。Fig. 5A is a flow chart summarizing an example of the disclosed method. As with other methods described herein, the blocks of method 500 need not be executed in the order indicated. In some examples, one or more blocks may be executed simultaneously. In addition, such a method may include more or fewer blocks than those shown and/or described. For example, some embodiments may not include block 501.

在该示例中，方法500是回声参考选择方法。方法500的框可以例如由控制系统(比如图2A或图3A的控制系统60a)执行。在一些示例中，方法500的框可以由回声参考选择器模块(比如上文参考图4描述的回声参考选择器402A)执行。In this example, method 500 is an echo reference selection method. The blocks of method 500 may be performed, for example, by a control system (such as control system 60a of FIG. 2A or FIG. 3A). In some examples, the blocks of method 500 may be performed by an echo reference selector module (such as echo reference selector 402A described above with reference to FIG. 4).

图5A的参考选择方法是本文中可以称为“贪婪”回声参考选择方法的示例，其涉及仅在MC-EMS的当前操作点上评估成本和预期性能提升(换句话说，MC-EMS当前正在使用多少个参考，包括已选择的回声参考)，并评估添加每个附加回声参考的结果，例如，按重要性降序排列。相应地，该示例涉及确定是否添加新的回声参考的过程。在一些实施方式中，在方法500中评估的回声参考可能已经根据估计的重要性进行了排名(例如，由回声参考重要性估计器401A)。如果采用更复杂的技术(如树搜索方法)，则在成本和性能方面可能会存在更优化的解决方案类型。替代示例可以涉及其他搜索和/或优化例程，包括蛮力方法。一些替代实施方式可以涉及确定是否丢掉或丢弃先前选择的回声参考。The reference selection method of FIG. 5A is an example of what may be referred to herein as a "greedy" echo reference selection method, which involves evaluating the cost and expected performance improvement only at the current operating point of the MC-EMS (in other words, how many references are currently being used by the MC-EMS, including the selected echo references), and evaluating the results of adding each additional echo reference, for example, in descending order of importance. Accordingly, the example involves a process for determining whether to add a new echo reference. In some embodiments, the echo references evaluated in method 500 may have been ranked according to estimated importance (e.g., by echo reference importance estimator 401A). If a more complex technique (such as a tree search method) is employed, there may be a more optimized type of solution in terms of cost and performance. Alternative examples may involve other search and/or optimization routines, including brute force methods. Some alternative embodiments may involve determining whether to throw away or discard a previously selected echo reference.

在该示例中，框501涉及确定EMS的当前性能水平是否大于或等于期望的性能水平。如果是，则过程终止(框510)。然而，如果确定当前性能水平低于期望性能水平，则在该示例中，过程继续到框502。根据该示例，框501的确定至少部分地基于指示EMS的当前性能的一个或多个指标，比如自适应滤波器系数数据或其他AEC统计数据、语音回声(SER)比数据等。在框501的确定由回声参考编排器302A做出的一些示例中，这一确定可以至少部分地基于来自MC-EMS203A的一个或多个指标350A。如上所述，一些实施方式可以不包括框501。In this example, block 501 involves determining whether the current performance level of the EMS is greater than or equal to the desired performance level. If so, the process terminates (block 510). However, if it is determined that the current performance level is lower than the desired performance level, in this example, the process continues to block 502. According to this example, the determination of block 501 is based at least in part on one or more indicators indicating the current performance of the EMS, such as adaptive filter coefficient data or other AEC statistics, speech echo (SER) ratio data, etc. In some examples where the determination of block 501 is made by the echo reference arranger 302A, this determination may be based at least in part on one or more indicators 350A from the MC-EMS 203A. As described above, some embodiments may not include block 501.

根据该示例，框502涉及按重要性对剩余未选择的回声参考进行排名，并且估计通过包括EMS尚未使用的最重要的回声参考而获得的潜在EMS性能提升。在框502的过程由回声参考编排器302A执行的一些示例中，该过程可以至少部分地基于由MC-EMS性能模型405A产生的信息423A，在一些示例中，该信息可以是或者包括如图3B或图3C所示的数据。在一些实施方式中，上述排名和预测过程可以在方法500的较早阶段执行，例如，当评估先前的回声参考时。在一些示例中，可以在执行方法500之前执行上述排名和预测过程。在先前已经执行了上述排名和预测过程的一些实施方式中，框502可以简单地涉及选择由这样的先前过程所确定的最高排名的未选择回声参考。According to this example, block 502 involves ranking the remaining unselected echo references by importance and estimating the potential EMS performance improvement obtained by including the most important echo references that have not yet been used by the EMS. In some examples where the process of block 502 is performed by the echo reference scheduler 302A, the process can be based at least in part on information 423A generated by the MC-EMS performance model 405A, which in some examples can be or include data as shown in FIG. 3B or FIG. 3C. In some embodiments, the above-described ranking and prediction process can be performed at an earlier stage of the method 500, for example, when evaluating previous echo references. In some examples, the above-described ranking and prediction process can be performed before performing the method 500. In some embodiments where the above-described ranking and prediction process has been previously performed, block 502 can simply involve selecting the highest ranked unselected echo reference determined by such a previous process.

在该示例中，框503涉及比较添加在框502中选择的回声参考的性能和成本。在框503的过程由回声参考编排器302A执行的一些示例中，框503可以至少部分地基于来自成本估计模块403A的信息422A，该信息指示将回声参考包括在回声参考集313A中的(多个)成本。In this example, block 503 involves comparing the performance and cost of adding the echo reference selected in block 502. In some examples where the process of block 503 is performed by echo reference organizer 302A, block 503 may be based at least in part on information 422A from cost estimation module 403A indicating the cost(s) of including the echo reference in echo reference set 313A.

因为性能和成本可能是具有不同范围和/或域的变量，所以直接比较这些变量可能具有挑战性。因此，在一些实施方式中，可以通过将可能是变量的性能和成本映射到类似的尺度(比如预定义的最小值与最大值之间的范围)来促进框503的评估。Because performance and cost may be variables with different ranges and/or domains, it may be challenging to directly compare these variables. Therefore, in some embodiments, the evaluation of block 503 may be facilitated by mapping performance and cost, which may be variables, to a similar scale (such as a range between a predefined minimum and maximum value).

在一些实施方式中，如果添加回声参考不会导致超过预定的网络带宽和/或计算成本的预算，则添加被评估的回声参考的成本可以简单地设置为零。在一些这样的示例中，如果添加回声参考将导致超过预定的网络带宽和/或计算成本的预算，则添加被评估的回声参考的成本可以被设置为无穷大。这种示例具有简单和高效的益处。以这种方式，控制系统可以简单地添加在预定的网络带宽和/或计算成本的预算所允许范围内的最大数量的回声参考。In some embodiments, if adding an echo reference does not result in exceeding a predetermined network bandwidth and/or computational cost budget, the cost of adding an evaluated echo reference can simply be set to zero. In some such examples, if adding an echo reference will result in exceeding a predetermined network bandwidth and/or computational cost budget, the cost of adding an evaluated echo reference can be set to infinity. Such an example has the benefit of simplicity and efficiency. In this way, the control system can simply add the maximum number of echo references allowed by the predetermined network bandwidth and/or computational cost budget.

根据一些示例，如果与添加回声参考相对应的估计性能提升不高于预定阈值(例如，1％、2％、3％、4％、5％、6％、7％、8％、9％、10％等)，则该估计性能提升可以设置为零。这样的方法可以防止由于包括仅增加微不足道的性能提升的回声参考而消耗网络带宽和/或计算开销。下文描述了成本确定的一些详细的替代示例。According to some examples, if the estimated performance improvement corresponding to adding the echo reference is not above a predetermined threshold (e.g., 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, etc.), then the estimated performance improvement can be set to zero. Such an approach can prevent consuming network bandwidth and/or computational overhead by including an echo reference that only adds a negligible performance improvement. Some detailed alternative examples of cost determination are described below.

在该示例中，框504涉及在给定框503的性能/成本评估的情况下确定是否将添加新的回声参考。在一些示例中，框503和504可以组合为单个框。根据该示例，框504涉及确定添加被评估的回声参考的成本是否将小于估计由添加该回声参考引起的EMS性能提升。在该示例中，如果估计成本不小于估计性能提升，则过程继续到框511并且方法500终止。然而，在该实施方式中，如果估计成本小于估计性能提升，则过程继续到框505。In this example, block 504 involves determining whether a new echo reference will be added given the performance/cost evaluation of block 503. In some examples, blocks 503 and 504 may be combined into a single block. According to this example, block 504 involves determining whether the cost of adding the evaluated echo reference will be less than the estimated EMS performance improvement caused by adding the echo reference. In this example, if the estimated cost is not less than the estimated performance improvement, the process continues to block 511 and the method 500 terminates. However, in this embodiment, if the estimated cost is less than the estimated performance improvement, the process continues to block 505.

根据该示例，框505涉及将新的回声参考添加到所选回声参考集。在一些实例中，框505可以包括通知渲染器202输出相关回声参考。根据一些示例，框505可以涉及通过本地网络发送回声参考或者向另一设备发送命令311以通过本地网络发送回声参考。According to this example, block 505 involves adding the new echo reference to the selected echo reference set. In some instances, block 505 may include notifying the renderer 202 to output the relevant echo reference. According to some examples, block 505 may involve sending the echo reference over a local network or sending a command 311 to another device to send the echo reference over a local network.

在方法500中评估的回声参考可以是本地回声参考或非本地回声参考，非本地回声参考可以在本地确定(例如，由如上所述的本地渲染器)或通过本地网络接收。因此，一些回声参考的成本估计可能涉及评估计算成本和网络成本两者。The echo reference evaluated in method 500 may be a local echo reference or a non-local echo reference, which may be determined locally (e.g., by a local renderer as described above) or received over a local network. Thus, cost estimation of some echo references may involve evaluating both computational costs and network costs.

根据一些示例，为了在框505之后评估下一个回声参考，控制系统可以简单地重置所选择的和未选择的回声参考并恢复到图5A的先前框，诸如框501、框502或框503。然而，更复杂的方法还可能涉及评估已经选择的参考，例如，对已经选择的所有参考进行排名，并决定是否丢掉具有最低估计重要性的回声参考。According to some examples, to evaluate the next echo reference after block 505, the control system may simply reset the selected and unselected echo references and revert to a previous block of FIG. 5A , such as block 501, block 502, or block 503. However, a more complex approach may also involve evaluating the references that have been selected, for example, ranking all references that have been selected and deciding whether to discard the echo reference with the lowest estimated importance.

替代回声参考形式Alternative echo reference form

回声参考可以以多种形式或变体来传输(或在诸如产生所有回声参考的设备等设备内本地使用)，这可能更改该特定回声参考的成本/效益比。例如，如果我们将回声参考变换为分段功率形式(换句话说，确定多个频带中的每一个频带中的功率并传输关于每个频带中的功率的分段功率信息)，则有可能降低通过本地网络发送回声参考的成本。然而，使用回声参考的低保真度变体的EMS可以获得的潜在改进通常也会较低。选择使得回声参考的任何特定变体可用可以被解释为使其成为潜在的选择候选。The echo reference may be transmitted (or used locally within a device, such as a device that generates all echo references) in a variety of forms or variants, which may change the cost/benefit ratio of that particular echo reference. For example, if we transform the echo reference into a segmented power form (in other words, determine the power in each of a plurality of frequency bands and transmit segmented power information about the power in each frequency band), it is possible to reduce the cost of sending the echo reference over the local network. However, the potential improvement that can be obtained by an EMS using a low-fidelity variant of the echo reference will also generally be lower. The choice of making any particular variant of the echo reference available can be interpreted as making it a potential candidate for selection.

在一些实施方式中，回声参考可以是下文所列出的以下形式之一(其中前四种按估计性能的降序排列)：In some embodiments, the echo reference may be in one of the following forms listed below (the first four are listed in descending order of estimated performance):

·全保真度(原始、确切)回声参考，这将产生全部的计算成本和网络成本(如果通过网络传输的话)Full fidelity (original, exact) echo reference, which incurs full computational cost and network cost (if transmitted over the network)

·下采样回声参考，其计算成本和网络成本将根据下采样因子按比例减少，但会产生下采样过程的计算成本；Downsample the echo reference, whose computational cost and network cost will be reduced proportionally to the downsampling factor, but incur the computational cost of the downsampling process;

·经由有损编码过程产生的已编码回声参考，其网络成本可以根据编码方案的压缩比而降低，但会产生编码和解码计算成本；The coded echo reference produced by the lossy coding process can reduce the network cost according to the compression ratio of the coding scheme, but incurs encoding and decoding computational costs;

·与回声参考相对应的分段功率信息，其网络成本可以显著降低，因为频带的数量可以远低于全保真度回声参考的子带数量，并且其计算成本可以显著降低，因为实施分段AES的成本远低于实施子带AEC的成本；或者Segmented power information corresponding to the echo reference, which can have significantly lower network cost because the number of frequency bands can be much lower than the number of sub-bands of the full-fidelity echo reference, and which can have significantly lower computational cost because the cost of implementing segmented AES is much lower than the cost of implementing sub-band AEC; or

·降低保真度以换取一定成本(无论是计算、网络还是其他成本，例如存储器)降低的任何其他形式。Any other form of reducing fidelity in exchange for a reduction in some cost (whether computational, network, or other cost such as memory).

图5B是概述了所公开方法的另一示例的流程图。与本文描述的其他方法一样，不必以所指示的顺序来执行方法550的框。在一些示例中，一个或多个框可以同时执行。此外，这样的方法可以包括比所示出和/或所描述的框更多或更少的框。Fig. 5B is a flow chart outlining another example of the disclosed method. As with other methods described herein, the blocks of method 550 need not be executed in the order indicated. In some examples, one or more blocks may be executed simultaneously. In addition, such a method may include more or fewer blocks than those shown and/or described.

方法550的框可以例如由控制系统(比如图2A或图3A的控制系统60a)执行。在一些示例中，方法550的框可以由回声参考选择器模块(比如上文参考图4描述的回声参考选择器402A)执行。The blocks of method 550 may be performed, for example, by a control system such as control system 60a of FIG. 2A or FIG. 3A. In some examples, the blocks of method 550 may be performed by an echo reference selector module such as echo reference selector 402A described above with reference to FIG. 4.

方法550考虑到以下事实：回声参考不一定以全保真度形式来传输或使用，而是可以以上述替代的部分保真度形式之一来传输或使用。因此，在方法550中，性能和成本的评估不涉及关于是否将使用或不使用全保真度形式的回声参考的二元决策。相反，方法550涉及确定是否包括回声参考的一个或多个低保真度版本，这可能涉及并且潜在地较少的EMS性能提升，但是成本较低。诸如方法550等方法为要由回声管理系统使用的潜在回声参考集提供了附加的灵活性。Method 550 takes into account the fact that the echo reference is not necessarily transmitted or used in full fidelity form, but may be transmitted or used in one of the alternative partial fidelity forms described above. Thus, in method 550, the evaluation of performance and cost does not involve a binary decision as to whether the echo reference will be used or not in full fidelity form. Rather, method 550 involves determining whether to include one or more low-fidelity versions of the echo reference, which may involve and potentially less EMS performance improvement, but at a lower cost. Methods such as method 550 provide additional flexibility for the potential set of echo references to be used by the echo management system.

在该示例中，方法550是上文参考图5A描述的回声参考选择方法500的扩展。因此，框501(如果包括的话)、502、503、504和505可以如上参考图5A来执行，除非下文有相反说明。方法550将包括框506和507的潜在迭代循环添加到方法500。根据该示例，如果确定(此处，在框504中)添加回声参考的一个版本的估计成本将不小于估计的EMS性能提升，则在框506中确定是否存在回声参考的另一版本。在一些示例中，回声参考的全保真度版本可以在较低保真度版本(如果有任何版本可用的话)之前被评估。根据该实施方式，如果在框506中确定回声参考的另一版本可用，则在框507中将选择回声参考的另一版本(例如，不是全保真度版本的最高保真度版本)并在框503中进行评估。In this example, method 550 is an extension of the echo reference selection method 500 described above with reference to FIG. 5A. Therefore, blocks 501 (if included), 502, 503, 504, and 505 may be performed as above with reference to FIG. 5A, unless otherwise specified below. Method 550 adds a potential iteration loop including blocks 506 and 507 to method 500. According to this example, if it is determined (here, in block 504) that the estimated cost of adding a version of the echo reference will not be less than the estimated EMS performance improvement, then in block 506, it is determined whether there is another version of the echo reference. In some examples, the full-fidelity version of the echo reference may be evaluated before a lower-fidelity version (if any version is available). According to this embodiment, if it is determined in block 506 that another version of the echo reference is available, then in block 507, another version of the echo reference (e.g., the highest fidelity version that is not the full-fidelity version) will be selected and evaluated in block 503.

因此，方法550涉及评估回声参考的较低保真度版本，如果有任何版本可用的话。这种较低保真度版本可以包括回声参考的下采样版本、经由有损编码过程产生的回声参考的已编码版本、和/或与回声参考相对应的分段功率信息。Thus, method 550 involves evaluating a lower fidelity version of the echo reference, if any is available. Such a lower fidelity version may include a downsampled version of the echo reference, an encoded version of the echo reference generated via a lossy encoding process, and/or segment power information corresponding to the echo reference.

成本模型Cost Model

回声参考的“成本”是指利用该参考进行回声管理所需的资源，无论是使用AEC还是AES。一些公开的实施方式可能涉及估计以下类型的成本中的一种或多种：The "cost" of an echo reference refers to the resources required to perform echo management using that reference, whether using AEC or AES. Some disclosed embodiments may involve estimating one or more of the following types of costs:

·计算成本，其可以参考对音频环境中的一个或多个设备上可用的有限量的处理能力的使用来确定。计算成本可以指以下一项或多项：Computational cost, which may be determined with reference to the use of a limited amount of processing power available on one or more devices in the audio environment. Computational cost may refer to one or more of the following:

ο使用该参考在特定听音设备上执行回声管理所需的成本。这可能是指在AEC或AES中使用该参考。人们会注意到，AEC在仓(bin)或子带(其是复数)上运行，并且需要比在频带(其数量与AES使用的仓/子带相比较少，并且频带功率是实数，而不是复数)上运行的AES多得多的CPU运算；o the cost of using that reference to perform echo management on a particular listening device. This may refer to using that reference in AEC or AES. One will note that AEC operates on bins or subbands (which are complex numbers) and requires many more CPU operations than AES which operates on bands (which are few in number compared to the bins/subbands used by AES, and band powers are real numbers, not complex numbers);

ο在使用编解码的参考时对回声参考进行编码或解码所需的成本；o the cost required to encode or decode the echo reference when using a coded reference;

ο对信号进行分段所需的成本(换句话说，将信号从简单的线性频域表示变换为分段频域表示)；和/或o the cost required to segment the signal (in other words, transform the signal from a simple linear frequency domain representation to a segmented frequency domain representation); and/or

ο产生回声参考所需的成本(例如，通过渲染器)。ο The cost required to generate the echo reference (e.g., by a renderer).

·网络成本，是指对有限量的网络资源的使用，比如用于在设备之间共享回声参考的本地网络(例如，音频环境中的本地无线网络)中可用的带宽。• Network cost, which refers to the use of a limited amount of network resources, such as the bandwidth available in a local network (e.g., a local wireless network in an audio environment) for sharing an echo reference between devices.

特定回声参考集的总成本可以被确定为该集合中的每个回声参考的成本之和。一些公开的示例涉及组合网络成本和计算成本。根据一些示例，总成本C_total可以按下式确定：The total cost of a particular set of echo references may be determined as the sum of the costs of each echo reference in the set. Some disclosed examples involve combining network costs and computational costs. According to some examples, the total cost C _total may be determined as follows:

在上述等式中，R_comp表示可用于回声管理的计算资源总量，R_network表示可用于回声管理的网络资源总量；表示与使用第m个参考相关联的计算成本，并且表示与使用第m个参考相关联的网络成本(其中，EMS中总共使用了M个参考)。人们可能会注意到这一定义意味着In the above equation, R _comp represents the total amount of computing resources available for echo management, and R _network represents the total amount of network resources available for echo management; represents the computational cost associated with using the mth reference, and represents the network cost associated with using the mth reference (where a total of M references are used in the EMS). One may note that this definition implies

0≤C_total≤1，0≤C _total ≤1,

并且C_total仅包括与变得受系统可用资源限制的成本最接近的成本分量。And _Ctotal includes only the cost components that are closest to the cost that becomes limited by the available resources of the system.

性能performance

回声管理系统(EMS)的“性能”可以指以下内容：The “performance” of an echo management system (EMS) can refer to the following:

·从麦克风馈送中去除的回声量，其可以用回声损失增强(ERLE)来度量，回声损失增强是以分贝为单位测量的并且是发送功率与残差信号的功率的比率。该指标可以例如根据基于应用的指标(比如支持自动语音识别(ASR)处理器执行在存在回声的情况下检测说出的特定关键词的唤醒词检测任务所需的最小ERLE)来标准化；The amount of echo removed from the microphone feed, which can be measured in terms of echo loss enhancement (ERLE), which is measured in decibels and is the ratio of the transmitted power to the power of the residual signal. This metric can be normalized, for example, based on an application-based metric such as the minimum ERLE required to enable an automatic speech recognition (ASR) processor to perform a wake-up word detection task of detecting a specific keyword spoken in the presence of echo;

·当受到房间噪声源、本地音频系统的非线性、双端通话等干扰时EMS的稳健性；Robustness of the EMS when subject to interference from room noise sources, nonlinearities of the local audio system, double-ended conversations, etc.

·当使用低于全保真度的回声参考时EMS的稳健性；Robustness of the EMS when using less than full fidelity echo references;

·EMS跟踪系统变化的能力，包括EMS初始收敛的能力；和/或· The ability of the EMS to track system changes, including the ability of the EMS to initially converge; and/or

·EMS跟踪经渲染音频场景的变化的能力。例如，这可以指回声参考协方差矩阵的移位以及EMS对非平稳非唯一性问题的稳健性。• The ability of the EMS to track changes in the rendered audio scene. This may refer, for example, to shifts in the echo reference covariance matrix and the robustness of the EMS to non-stationary non-uniqueness issues.

一些示例可能涉及确定单个性能指标P。一些这样的示例使用ERLE和根据自适应滤波器系数数据或从EMS获得的其他AEC统计数据而估计的稳健性。根据一些这样的示例，性能稳健性指标P_rob可以使用从AEC提取的“麦克风概率”来确定，例如如下：Some examples may involve determining a single performance metric P. Some such examples use ERLE and robustness estimated from adaptive filter coefficient data or other AEC statistics obtained from EMS. According to some such examples, the performance robustness metric _Prob may be determined using "microphone probabilities" extracted from the AEC, such as as follows:

P_Rob＝1-M_probP _Rob = 1-M_prob

在上述等式中，0≤P_Ro_b≤1，0≤M_prob≤1，并且M_prob表示麦克风概率，它是AEC中产生在各自子带中不提供实质性(或任何)回声消除的不良回声预测的子带自适应滤波器的数量比例。In the above equation, 0≤P _R o _b ≤1, 0≤M_prob≤1, and M_prob represents the microphone probability, which is the proportion of the number of subband adaptive filters in the AEC that produce poor echo predictions that do not provide substantial (or any) echo cancellation in the respective subband.

唤醒词(WW)检测器的性能很大程度上取决于语音回声比(SER)，该语音回声比可通过EMS的ERLE按比例提高。当SER太低时，WW检测器更有可能错误触发(误报)并漏掉用户说出的关键词(漏检测)，因为回声会破坏麦克风信号并降低系统的准确度。由ASR处理器(例如，图2A的语音处理块240A)消耗的残差信号(例如，图2A的残差信号224A)的SER由EMS与EMS的ERLE成比例地提高，从而改进WW检测器的性能。The performance of the wake-up word (WW) detector depends largely on the speech echo ratio (SER), which can be improved proportionally by the ERLE of the EMS. When the SER is too low, the WW detector is more likely to trigger falsely (false alarms) and miss keywords spoken by the user (missed detections) because the echo corrupts the microphone signal and reduces the accuracy of the system. The SER of the residual signal (e.g., residual signal 224A of FIG. 2A) consumed by the ASR processor (e.g., speech processing block 240A of FIG. 2A) is improved by the EMS in proportion to the ERLE of the EMS, thereby improving the performance of the WW detector.

因此，一些公开的示例涉及将期望的WW性能水平映射到标称SER水平，这进而结合系统中设备的典型回放水平的知识而允许控制系统将这种期望的WW性能水平直接映射到标称ERLE。在一些示例中，可以扩展该方法以将系统在各种SER水平的WW性能映射到ERLE。在一些这样的实施方式中，可以使用具有一定范围的SER值的输入数据来产生特定WW检测器的接收器操作特性(ROC)曲线。一些示例涉及将选择感兴趣的特定误报率(FAR)并针对该特定FAR将WW检测器的准确度作为SER的函数来作为我们的应用基础。在一些这样的示例中，Thus, some disclosed examples involve mapping a desired WW performance level to a nominal SER level, which in turn, combined with knowledge of typical playback levels of devices in the system, allows the control system to map such desired WW performance level directly to a nominal ERLE. In some examples, the method can be extended to map the WW performance of the system at various SER levels to an ERLE. In some such embodiments, input data having a range of SER values can be used to generate a receiver operating characteristic (ROC) curve for a particular WW detector. Some examples involve selecting a particular false alarm rate (FAR) of interest and basing our application on the accuracy of the WW detector as a function of SER for that particular FAR. In some such examples,

Acc(SER_res)＝ROC(SER_res，FAR_l)Acc(SER _res )=ROC(SER _res ,FAR _l )

上述等式中，Acc(SER_res)表示WW检测器的准确度作为表示由EMS输出的残差信号的SER的SER_res的函数。ROC()表示针对多个SER的ROC曲线的集合，并且FAR_I表示感兴趣的误报率，其典型值可以是每24小时3个和每10小时1个。准确度Acc(SER_res)可以表示为百分比或被归一化使得其在0到1的范围内，其可以表达如下：In the above equation, Acc(SER _res ) represents the accuracy of the WW detector as a function of SER _res representing the SER of the residual signal output by the EMS. ROC() represents a collection of ROC curves for multiple SERs, and FAR _I represents the false alarm rate of interest, typical values of which may be 3 per 24 hours and 1 per 10 hours. The accuracy Acc(SER _res ) may be expressed as a percentage or normalized so that it is in the range of 0 to 1, which may be expressed as follows:

0≤Acc(SER_res)≤10≤Acc(SER _res )≤1

有了音频环境中的音频设备的回放能力的知识，就可以结合使用例如实际回声水平的LUPA分量和目标音频环境中典型的语音水平来确定麦克风信号(例如，图2A的麦克风信号223A)中的典型SER值，例如如下：With knowledge of the playback capabilities of the audio devices in the audio environment, a typical SER value in a microphone signal (e.g., microphone signal 223A of FIG. 2A ) may be determined using, for example, the LUPA component of the actual echo level and the typical speech level in the target audio environment, for example, as follows:

在上述等式中，Speech_pwr和Echo_pwr分别表示目标音频环境的预期基线语音功率水平和回声功率水平。通过EMS，SER_mic可以改进为与ERLE成比例的SER_res，例如如下：In the above equation, Speech_pwr and Echo_pwr represent the expected baseline speech power level and echo power level of the target audio environment, respectively. Through EMS, the SER _mic can be improved to SER _res proportional to ERLE, for example as follows:

在上述等式中，上标dB指示在该示例中变量以分贝为单位。为了完整性，一些实施方式可以将EMS的ERLE定义如下：In the above equations, the superscript dB indicates that the variable is in decibels in this example. For completeness, some embodiments may define the ERLE of the EMS as follows:

使用前述等式，一些实施方式可以定义基于WW应用的EMS性能指标，如下所示：Using the aforementioned equation, some embodiments may define the EMS performance metric based on the WW application as follows:

其中，代表目标环境中的SER。在一些示例中，可以是静态默认数字，而在其他示例中，可以被估计为例如一个或多个LUPA分量的函数。一些实施方式可以涉及将净性能指标P定义为包含每个要素的向量，例如如下：in, Represents the SER in the target environment. In some examples, can be a static default number, while in other examples, It can be estimated as a function of one or more LUPA components, for example. Some embodiments may involve defining the net performance indicator P as a vector containing each element, for example as follows:

P＝[P_ww,P_Rob]P＝[P _ww ,P _Rob ]

在一些示例中，可以通过增加净性能向量的大小来添加一个或多个附加性能分量。在一些替代示例中，一个或多个附加性能分量可以通过对它们进行加权而组合成单个标量指标，例如如下：In some examples, one or more additional performance components may be added by increasing the size of the net performance vector. In some alternative examples, one or more additional performance components may be combined into a single scalar metric by weighting them, such as as follows:

P＝(1-K)P_ww+KP_Rob P＝(1-K) _Pww + _KPRob

在上述等式中，K表示由系统设计者选择的加权因子，其用于确定每个分量对净性能的贡献程度。一些替代示例可以使用另一种方法，例如，只是对各个性能指标进行平均。然而，将各个性能指标组合成单个标量指标可能是有利的。In the above equation, K represents a weighting factor selected by the system designer that is used to determine how much each component contributes to the net performance. Some alternative examples may use another approach, such as simply averaging the individual performance indicators. However, it may be advantageous to combine the individual performance indicators into a single scalar indicator.

成本和性能的权衡Cost and performance trade-off

当比较回声参考的估计成本和估计EMS性能增强时，需要一种方法以某种方式比较通常不在同一域中的这两个参数。一种这样的方法涉及单独地评估成本估计和性能估计，并采用成本最低且满足预定义最低性能标准P_min的解决方案。该预定义的EMS性能标准可以例如根据特定下游应用的要求(例如，提供电话通话、音乐回放、等待WW等)来确定。When comparing the estimated cost of the echo reference and the estimated EMS performance enhancement, a method is needed to somehow compare these two parameters that are usually not in the same domain. One such method involves evaluating the cost estimate and the performance estimate separately and adopting the solution with the lowest cost and meeting a predefined minimum performance standard _Pmin . This predefined EMS performance standard can be determined, for example, according to the requirements of a specific downstream application (e.g., providing phone calls, music playback, waiting for WW, etc.).

例如，在应用是WW检测的实施方式中，性能可以与WW性能指标P_WW相关。在一些这样的示例中，可能存在被认为足够的某个最低水平的WW检测器准确度(例如，80％水平的WW检测器准确度、85％水平的WW检测器准确度、90％水平的WW检测器准确度、95％水平的WW检测器准确度等)，这按照上一节将具有对应的ERLE^dB。在一些这样的示例中，可以使用EMS性能模型(例如，图4的MC-EMS性能模型405)来估计EMS的ERLE。因此，如果目标只是找到成本最低的解决方案(例如，就总成本C_total而言)，则这样的实施方式不需要直接权衡成本和性能。For example, in an embodiment where the application is WW detection, the performance may be related to a WW performance indicator P _WW . In some such examples, there may be a certain minimum level of WW detector accuracy that is considered sufficient (e.g., 80% level of WW detector accuracy, 85% level of WW detector accuracy, 90% level of WW detector accuracy, 95% level of WW detector accuracy, etc.), which will have a corresponding ERLE ^dB according to the previous section. In some such examples, an EMS performance model (e.g., MC-EMS performance model 405 of FIG. 4) may be used to estimate the ERLE of the EMS. Therefore, if the goal is simply to find the lowest cost solution (e.g., in terms of total cost C _total ), such an embodiment does not require a direct trade-off between cost and performance.

作为满足一些最低性能指标的替代方案，一些实施方式可能涉及使用性能指标P和成本指标C。一些这样的示例可能涉及使用权衡参数λ(例如，拉格朗日乘数)，并将成本/性能评估过程表述为寻求使某个量最大化的优化问题，比如在以下表达式中的变量F：As an alternative to meeting some minimum performance metric, some embodiments may involve using a performance metric P and a cost metric C. Some such examples may involve using a trade-off parameter λ (e.g., a Lagrange multiplier) and formulating the cost/performance evaluation process as an optimization problem that seeks to maximize some quantity, such as the variable F in the following expression:

F＝P-λC_total F＝P-λC _total

可以观察到，在上述等式中，F的值相对较大对应于性能指标P与λ和总成本C_total的积之间的差相对较大。权衡参数λ可以(例如，由系统设计者)选择以便直接权衡成本和性能。然后可以使用优化算法来找到由EMS使用的回声参考集的解，其中，回声参考集(其可以包括所有可用的回声参考保真度水平)决定了搜索空间。It can be observed that in the above equation, a relatively large value of F corresponds to a relatively large difference between the performance indicator P and the product of λ and the total cost C _total . The trade-off parameter λ can be selected (e.g., by a system designer) to directly trade off cost and performance. An optimization algorithm can then be used to find a solution for the echo reference set used by the EMS, where the echo reference set (which can include all available echo reference fidelity levels) determines the search space.

图6是概述所公开方法的一个示例的流程图。与本文描述的其他方法一样，不必以所指示的顺序来执行方法600的框。此外，这样的方法可以包括比所示出和/或所描述的框更多或更少的框。在一些示例中，两个或更多个框可以同时执行。在该示例中，方法600是音频处理方法。Fig. 6 is a flow chart summarizing an example of the disclosed method. As with other methods described herein, the blocks of method 600 need not be performed in the order indicated. In addition, such a method may include more or fewer blocks than those shown and/or described. In some examples, two or more blocks may be performed simultaneously. In this example, method 600 is an audio processing method.

方法600可以由如图1A中示出且上文描述的装置50的装置或系统执行。在一些示例中，方法600的框可以由音频环境内的一个或多个设备来执行，例如，由音频系统控制器(如本文中被称为智能家居中枢的设备)或由音频系统的另一个部件来执行，如智能扬声器、电视、电视控制模块、膝上型计算机、移动设备(如蜂窝电话)等。在一些实施方式中，音频环境可以包括家庭环境的一个或多个房间。在其他示例中，音频环境可以是另一种类型的环境，如办公室环境、汽车环境、火车环境、街道或人行道环境、公园环境等。然而，在替代性实施方式中，方法600的至少一些框可以由实施基于云的服务的设备(如服务器)来执行。Method 600 may be performed by an apparatus or system such as apparatus 50 shown in FIG. 1A and described above. In some examples, the blocks of method 600 may be performed by one or more devices within an audio environment, for example, by an audio system controller (such as a device referred to herein as a smart home hub) or by another component of an audio system, such as a smart speaker, a television, a television control module, a laptop computer, a mobile device (such as a cellular phone), etc. In some embodiments, the audio environment may include one or more rooms of a home environment. In other examples, the audio environment may be another type of environment, such as an office environment, a car environment, a train environment, a street or sidewalk environment, a park environment, etc. However, in alternative embodiments, at least some blocks of method 600 may be performed by a device (such as a server) that implements a cloud-based service.

在该实施方式中，框605涉及由控制系统获得多个回声参考。在该示例中，多个回声参考包括针对音频环境中的多个音频设备中的每个音频设备的至少一个回声参考。这里，每个回声参考对应于由多个音频设备中的一个音频设备的一个或多个扩音器回放的音频数据。In this embodiment, block 605 involves obtaining, by the control system, a plurality of echo references. In this example, the plurality of echo references includes at least one echo reference for each of a plurality of audio devices in the audio environment. Here, each echo reference corresponds to audio data played back by one or more loudspeakers of an audio device in the plurality of audio devices.

在该示例中，框610涉及由控制系统对多个回声参考中的每个回声参考做出重要性估计。根据该示例，做出重要性估计涉及确定每个回声参考对由音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻的预期贡献。在该示例中，至少一个回声管理系统包括声学回声消除器(AEC)和/或声学回声抑制器(AES)。In this example, block 610 involves making, by the control system, an importance estimate for each of the plurality of echo references. According to this example, making the importance estimate involves determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. In this example, the at least one echo management system includes an acoustic echo canceller (AEC) and/or an acoustic echo suppressor (AES).

在该实施方式中，框615涉及由控制系统并且至少部分地基于重要性估计来选择一个或多个所选回声参考。在该示例中，框620涉及由控制系统将一个或多个所选回声参考提供给至少一个回声管理系统。在一些实施方式中，方法600可以涉及使得至少一个回声管理系统至少部分地基于一个或多个所选回声参考来消除或抑制回声。In this embodiment, block 615 involves selecting, by the control system and based at least in part on the importance estimates, one or more selected echo references. In this example, block 620 involves providing, by the control system, the one or more selected echo references to at least one echo management system. In some embodiments, method 600 may involve causing at least one echo management system to cancel or suppress echoes based at least in part on the one or more selected echo references.

在一些示例中，获得多个回声参考可以涉及接收包括音频数据的内容流并基于音频数据来确定多个回声参考中的一个或多个回声参考。上文参考图2A的渲染器201A描述了一些示例。In some examples, obtaining the plurality of echo references may involve receiving a content stream including audio data and determining one or more of the plurality of echo references based on the audio data.Some examples are described above with reference to the renderer 201A of FIG. 2A .

在一些实施方式中，控制系统可以包括音频环境中的多个音频设备中的音频设备的音频设备控制系统。在一些这样的示例中，所述方法可以涉及由音频设备控制系统渲染音频数据以用于在音频设备上再现，从而产生本地扬声器馈送信号。在一些这样的示例中，所述方法可以涉及确定与本地扬声器馈送信号相对应的本地回声参考。In some implementations, the control system may include an audio device control system for an audio device in a plurality of audio devices in the audio environment. In some such examples, the method may involve rendering, by the audio device control system, audio data for reproduction on the audio device, thereby generating a local speaker feed signal. In some such examples, the method may involve determining a local echo reference corresponding to the local speaker feed signal.

在一些示例中，获得多个回声参考可以涉及基于音频数据来确定一个或多个非本地回声参考。例如，每个非本地回声参考可以对应于用于在音频环境的另一音频设备上回放的非本地扬声器馈送信号。In some examples, obtaining multiple echo references can involve determining one or more non-local echo references based on the audio data.For example, each non-local echo reference can correspond to a non-local speaker feed signal for playback on another audio device in the audio environment.

根据一些示例，获得多个回声参考可以涉及接收一个或多个非本地回声参考。例如，每个非本地回声参考可以对应于用于在音频环境的另一音频设备上回放的非本地扬声器馈送信号。在一些示例中，接收一个或多个非本地回声参考可以涉及从音频环境的一个或多个其他音频设备接收一个或多个非本地回声参考。在一些示例中，接收一个或多个非本地回声参考可以涉及从音频环境的单个其他设备接收一个或多个非本地回声参考中的每一个。According to some examples, obtaining multiple echo references may involve receiving one or more non-local echo references. For example, each non-local echo reference may correspond to a non-local speaker feed signal for playback on another audio device of the audio environment. In some examples, receiving one or more non-local echo references may involve receiving one or more non-local echo references from one or more other audio devices of the audio environment. In some examples, receiving one or more non-local echo references may involve receiving each of the one or more non-local echo references from a single other device of the audio environment.

在一些示例中，所述方法可以涉及成本确定。根据一些这样的示例，成本确定可以涉及确定多个回声参考中的至少一个回声参考的成本。在一些这样的示例中，选择一个或多个所选回声参考可以至少部分地基于成本确定。根据一些这样的示例，成本确定可以至少部分地基于用于传输至少一个回声参考所需的网络带宽、用于编码至少一个回声参考的编码计算要求、用于解码至少一个回声参考的解码计算要求、用于由回声管理系统使用至少一个回声参考的回声管理系统计算要求、或其一个或多个组合。在一些示例中，成本确定可以至少部分地基于至少一个回声参考在时域或频域中的全保真度复制品、至少一个回声参考的下采样版本、至少一个回声参考的有损压缩、至少一个回声参考的分段功率信息、或其一个或多个组合。在一些示例中，成本确定可以至少部分地基于与相对不太重要的回声参考相比对相对更重要的回声参考进行更少压缩的方法。In some examples, the method may involve cost determination. According to some such examples, the cost determination may involve determining a cost of at least one echo reference of a plurality of echo references. In some such examples, selecting one or more selected echo references may be based at least in part on the cost determination. According to some such examples, the cost determination may be based at least in part on a network bandwidth required for transmitting at least one echo reference, an encoding computational requirement for encoding at least one echo reference, a decoding computational requirement for decoding at least one echo reference, an echo management system computational requirement for using at least one echo reference by an echo management system, or one or more combinations thereof. In some examples, the cost determination may be based at least in part on a full-fidelity replica of at least one echo reference in a time domain or a frequency domain, a downsampled version of at least one echo reference, a lossy compression of at least one echo reference, segmented power information of at least one echo reference, or one or more combinations thereof. In some examples, the cost determination may be based at least in part on a method of compressing relatively more important echo references less than relatively less important echo references.

根据一些示例，所述方法可以涉及确定当前回声管理系统性能水平。在一些这样的示例中，选择一个或多个所选回声参考可以至少部分地基于当前回声管理系统性能水平。According to some examples, the method may involve determining a current echo management system performance level.In some such examples, selecting one or more selected echo references may be based at least in part on the current echo management system performance level.

在一些示例中，做出重要性估计可以涉及确定对应回声参考的重要性度量。在一些示例中，确定重要性度量可以涉及确定对应回声参考的水平、确定对应回声参考的唯一性、确定对应回声参考的时间持续性、确定对应回声参考的可听度、或其一个或多个组合。根据一些示例，确定重要性度量可以至少部分地基于与音频设备布局相对应的元数据、扩音器元数据、与接收到的音频数据相对应的元数据、上混合矩阵、扩音器激活矩阵、或其一个或多个组合。在一些示例中，确定重要性度量可以至少部分地基于当前收听目标、当前环境噪声估计、至少一个回声管理系统的当前性能的估计、或其一个或多个组合。In some examples, making an importance estimate may involve determining an importance metric for a corresponding echo reference. In some examples, determining the importance metric may involve determining a level of a corresponding echo reference, determining a uniqueness of a corresponding echo reference, determining a temporal duration of a corresponding echo reference, determining an audibility of a corresponding echo reference, or one or more combinations thereof. According to some examples, determining the importance metric may be based at least in part on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, determining the importance metric may be based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of at least one echo management system, or one or more combinations thereof.

图7示出了音频环境的平面图的示例，所述音频环境在该示例中是生活空间。与本文提供的其他图一样，图7中示出的要素的类型和数量仅作为示例提供。其他实施方式可以包括更多、更少和/或不同类型和数量的要素。Fig. 7 shows an example of a floor plan of an audio environment, which in this example is a living space. As with other figures provided herein, the types and quantities of the elements shown in Fig. 7 are provided only as examples. Other embodiments may include more, fewer and/or different types and quantities of elements.

根据该示例，环境700包括在左上方处的客厅710、在下方中央处的厨房715、以及在右下方的卧室722。跨生活空间分布的方框和圆圈表示一组扩音器705a-705h，该组扩音器中的至少一些扩音器在一些实施方式中可以是智能扬声器，放置在对空间方便的位置，但不遵循任何标准规定的布局(任意地放置)。在一些示例中，电视730可以被配置为至少部分地实施一个或多个所公开的实施例。在该示例中，环境700包括分布在整个环境中的相机711a-711e。在一些实施方式中，环境700中的一个或多个智能音频设备还可以包括一个或多个相机。所述一个或多个智能音频设备可以是单一用途音频设备或虚拟助理。在一些这样的示例中，可选传感器系统130的一个或多个相机可以驻留在电视730中或所述电视上、移动电话中或智能扬声器(如扩音器705b、705d、705e或705h中的一个或多个)中。尽管相机711a-711e没有在本公开中呈现的音频环境的每个描绘中示出，但在一些实施方式中，每个音频环境仍然可以包括一个或多个相机。According to this example, the environment 700 includes a living room 710 at the upper left, a kitchen 715 at the lower center, and a bedroom 722 at the lower right. The boxes and circles distributed across the living space represent a group of loudspeakers 705a-705h, at least some of which may be smart speakers in some embodiments, placed in a convenient position for the space, but not following any standard prescribed layout (arbitrarily placed). In some examples, the television 730 may be configured to at least partially implement one or more disclosed embodiments. In this example, the environment 700 includes cameras 711a-711e distributed throughout the environment. In some embodiments, one or more smart audio devices in the environment 700 may also include one or more cameras. The one or more smart audio devices may be single-purpose audio devices or virtual assistants. In some such examples, one or more cameras of the optional sensor system 130 may reside in or on the television 730, in a mobile phone, or in a smart speaker (such as one or more of the loudspeakers 705b, 705d, 705e, or 705h). Although cameras 711a - 711e are not shown in each depiction of an audio environment presented in this disclosure, in some implementations, each audio environment may still include one or more cameras.

本公开的一些方面包括一种被配置(例如，被编程)成执行所公开方法的一个或多个示例的系统或设备，以及一种存储用于实施所公开方法或其步骤的一个或多个示例的代码的有形计算机可读介质(例如，磁盘)。例如，一些公开的系统可以是或者包括可编程通用处理器、数字信号处理器或微处理器，所述可编程通用处理器、数字信号处理器或微处理器用软件或固件编程为和/或以其他方式被配置成对数据执行各种操作中的任一个，包括所公开方法或其步骤的实施例。这样的通用处理器可以是或者包括计算机系统，所述计算机系统包括输入设备、存储器和处理子系统，所述处理子系统被编程(和/或以其他方式被配置)为响应于向其断言的数据而执行所公开方法(或其步骤)的一个或多个示例。Some aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer-readable medium (e.g., a disk) storing code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems may be or include a programmable general-purpose processor, a digital signal processor, or a microprocessor that is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including embodiments of the disclosed methods or steps thereof. Such a general-purpose processor may be or include a computer system that includes an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.

一些实施例可以被实施为可配置的(例如，可编程的)数字信号处理器(DSP)，所述数字信号处理器被配置(例如，被编程和以其他方式被配置)为对(多个)音频信号执行需要的处理，包括对所公开方法的一个或多个示例的执行。可替代地，所公开系统(或其元件)的实施例可以被实施为通用处理器(例如，个人计算机(PC)或其他计算机系统或微处理器，其可以包括输入设备和存储器)，所述通用处理器用软件或固件编程为和/或以其他方式被配置成执行各种操作中的任一个，包括所公开方法的一个或多个示例。可替代地，本发明系统的一些实施例的元件被实施为被配置(例如，被编程)成执行所公开方法的一个或多个示例的通用处理器或DSP，并且所述系统还包括其他元件(例如，一个或多个扩音器和/或一个或多个麦克风)。被配置成执行所公开方法的一个或多个示例的通用处理器可以耦接到输入设备(例如，鼠标和/或键盘)、存储器和显示设备。Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP), which is configured (e.g., programmed and otherwise configured) to perform the required processing on (multiple) audio signals, including the execution of one or more examples of the disclosed method. Alternatively, embodiments of the disclosed system (or its elements) may be implemented as a general-purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory), which is programmed with software or firmware and/or otherwise configured to perform any of various operations, including one or more examples of the disclosed method. Alternatively, the elements of some embodiments of the system of the present invention are implemented as a general-purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed method, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). The general-purpose processor configured to perform one or more examples of the disclosed method may be coupled to an input device (e.g., a mouse and/or keyboard), a memory, and a display device.

本公开的另一方面是一种计算机可读介质(例如，磁盘或其他有形存储介质)，所述计算机可读介质存储用于执行所公开方法或其步骤的一个或多个示例的代码(例如，可执行以执行所公开方法或其步骤的一个或多个示例的编码器)。Another aspect of the present disclosure is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code for executing one or more examples of the disclosed method or its steps (e.g., an encoder executable to execute one or more examples of the disclosed method or its steps).

虽然在本文中已经描述了本公开的具体实施例和本公开的应用，但是对于本领域普通技术人员而言显而易见的是，在不脱离本文描述的并要求保护的本公开的范围的情况下，可以对本文描述的实施例和应用进行许多改变。应当理解，虽然已经示出和描述了本公开的某些形式，但是本公开不限于所描述和示出的具体实施例或所描述的具体方法。Although specific embodiments of the present disclosure and applications of the present disclosure have been described herein, it will be apparent to those skilled in the art that many changes may be made to the embodiments and applications described herein without departing from the scope of the present disclosure as described and claimed herein. It should be understood that although certain forms of the present disclosure have been shown and described, the present disclosure is not limited to the specific embodiments described and shown or the specific methods described.

可以从以下枚举的示例实施例(EEE)中理解本发明的各个方面：Various aspects of the present invention may be understood from the following enumerated example embodiments (EEE):

1.一种音频处理方法，包括：1. An audio processing method, comprising:

由控制系统获得多个回声参考，所述多个回声参考包括针对音频环境中的多个音频设备中的每个音频设备的至少一个回声参考，每个回声参考对应于由所述多个音频设备中的一个音频设备的一个或多个扩音器回放的音频数据；obtaining, by a control system, a plurality of echo references, the plurality of echo references comprising at least one echo reference for each of a plurality of audio devices in an audio environment, each echo reference corresponding to audio data played back by one or more loudspeakers of one of the plurality of audio devices;

由所述控制系统对所述多个回声参考中的每个回声参考做出重要性估计，其中，做出所述重要性估计涉及确定每个回声参考对由所述音频环境的至少一个音频设备的至少一个回声管理系统进行的回声减轻的预期贡献，所述至少一个回声管理系统包括声学回声消除器(AEC)、声学回声抑制器(AES)、或者AEC和AES两者；making, by the control system, an importance estimate for each of the plurality of echo references, wherein making the importance estimate involves determining an expected contribution of each echo reference to echo mitigation performed by at least one echo management system of at least one audio device of the audio environment, the at least one echo management system comprising an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or both an AEC and an AES;

由所述控制系统并且至少部分地基于所述重要性估计来选择一个或多个所选回声参考；以及selecting, by the control system and based at least in part on the importance estimate, one or more selected echo references; and

由所述控制系统将所述一个或多个所选回声参考提供给所述至少一个回声管理系统。The one or more selected echo references are provided by the control system to the at least one echo management system.

2.如EEE 1所述的音频处理方法，进一步包括使得至少一个回声管理系统至少部分地基于所述一个或多个所选回声参考来消除或抑制回声。2. The audio processing method of EEE 1, further comprising causing at least one echo management system to cancel or suppress echo based at least in part on the one or more selected echo references.

3.如EEE 1或EEE 2所述的音频处理方法，其中，获得所述多个回声参考涉及：3. The audio processing method according to EEE 1 or EEE 2, wherein obtaining the plurality of echo references involves:

接收包括音频数据的内容流；以及receiving a content stream including audio data; and

基于所述音频数据确定所述多个回声参考中的一个或多个回声参考。One or more echo references of the plurality of echo references are determined based on the audio data.

4.如EEE 3所述的音频处理方法，其中，所述控制系统包括所述音频环境中的所述多个音频设备中的音频设备的音频设备控制系统，所述音频处理方法进一步包括：4. The audio processing method according to EEE 3, wherein the control system comprises an audio device control system of an audio device among the plurality of audio devices in the audio environment, and the audio processing method further comprises:

由所述音频设备控制系统渲染所述音频数据以供在所述音频设备上再现，以产生本地扬声器馈送信号；以及rendering, by the audio device control system, the audio data for reproduction on the audio device to produce local speaker feed signals; and

确定与所述本地扬声器馈送信号相对应的本地回声参考。A local echo reference corresponding to the local loudspeaker feed signal is determined.

5.如EEE 4所述的音频处理方法，其中，获得所述多个回声参考涉及基于所述音频数据确定一个或多个非本地回声参考，所述非本地回声参考中的每一个对应于用于在所述音频环境的另一音频设备上回放的非本地扬声器馈送信号。5. The audio processing method of EEE 4, wherein obtaining the plurality of echo references involves determining one or more non-local echo references based on the audio data, each of the non-local echo references corresponding to a non-local speaker feed signal for playback on another audio device in the audio environment.

6.如EEE 4所述的音频处理方法，其中，获得所述多个回声参考涉及接收一个或多个非本地回声参考，所述非本地回声参考中的每一个对应于用于在所述音频环境的另一音频设备上回放的非本地扬声器馈送信号。6. The audio processing method of EEE 4, wherein obtaining the plurality of echo references involves receiving one or more non-local echo references, each of the non-local echo references corresponding to a non-local speaker feed signal for playback on another audio device in the audio environment.

7.如EEE 6所述的音频处理方法，其中，接收所述一个或多个非本地回声参考涉及从所述音频环境的一个或多个其他音频设备接收所述一个或多个非本地回声参考。7. The audio processing method of EEE 6, wherein receiving the one or more non-local echo references involves receiving the one or more non-local echo references from one or more other audio devices of the audio environment.

8.如EEE 6所述的音频处理方法，其中，接收所述一个或多个非本地回声参考涉及从所述音频环境的单个其他设备接收所述一个或多个非本地回声参考中的每一个。8. The audio processing method of EEE 6, wherein receiving the one or more non-local echo references involves receiving each of the one or more non-local echo references from a single other device of the audio environment.

9.如EEE 1-8中任一项所述的音频处理方法，进一步包括成本确定，所述成本确定涉及确定所述多个回声参考中的至少一个回声参考的成本，其中，选择所述一个或多个所选回声参考至少部分地基于所述成本确定。9. The audio processing method as described in any one of EEE 1-8 further includes cost determination, which involves determining the cost of at least one echo reference of the multiple echo references, wherein selecting the one or more selected echo references is at least partially based on the cost determination.

10.如EEE 9所述的音频处理方法，其中，所述成本确定基于用于传输所述至少一个回声参考所需的网络带宽、用于编码所述至少一个回声参考的编码计算要求、用于解码所述至少一个回声参考的解码计算要求、用于由所述回声管理系统使用所述至少一个回声参考的回声管理系统计算要求、或其组合。10. The audio processing method of EEE 9, wherein the cost determination is based on network bandwidth required for transmitting the at least one echo reference, encoding computational requirements for encoding the at least one echo reference, decoding computational requirements for decoding the at least one echo reference, echo management system computational requirements for using the at least one echo reference by the echo management system, or a combination thereof.

11.如EEE 9或EEE 10所述的音频处理方法，其中，所述成本确定基于所述至少一个回声参考在时域或频域中的复制品、所述至少一个回声参考的下采样版本、所述至少一个回声参考的有损压缩、所述至少一个回声参考的分段功率信息、或其组合。11. The audio processing method as described in EEE 9 or EEE 10, wherein the cost determination is based on a replica of the at least one echo reference in the time domain or the frequency domain, a downsampled version of the at least one echo reference, a lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, or a combination thereof.

12.如EEE 9-11中任一项所述的音频处理方法，其中，所述成本确定基于与相对不太重要的回声参考相比对相对更重要的回声参考进行更少压缩的方法。12. The audio processing method of any one of EEEs 9-11, wherein the cost determination is based on a method of compressing relatively more important echo references less than relatively less important echo references.

13.如EEE 1-12中任一项所述的音频处理方法，进一步包括确定当前回声管理系统性能水平，其中，选择所述一个或多个所选回声参考至少部分地基于所述当前回声管理系统性能水平。13. The audio processing method of any one of EEEs 1-12, further comprising determining a current echo management system performance level, wherein selecting the one or more selected echo references is based at least in part on the current echo management system performance level.

14.如EEE 1-13中任一项所述的音频处理方法，其中，做出所述重要性估计涉及确定对应回声参考的重要性度量。14. The audio processing method as described in any one of EEE 1-13, wherein making the importance estimate involves determining an importance measure of the corresponding echo reference.

15.如EEE 14所述的音频处理方法，其中，确定所述重要性度量涉及确定所述对应回声参考的水平、确定所述对应回声参考的唯一性、确定所述对应回声参考的时间持续性、确定所述对应回声参考的可听度、或其组合。15. The audio processing method of EEE 14, wherein determining the importance metric involves determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a temporal duration of the corresponding echo reference, determining an audibility of the corresponding echo reference, or a combination thereof.

16.如EEE 14或EEE 15所述的音频处理方法，其中，确定所述重要性度量至少部分地基于与音频设备布局相对应的元数据、扩音器元数据、与接收到的音频数据相对应的元数据、上混合矩阵、扩音器激活矩阵、或其组合。16. An audio processing method as described in EEE 14 or EEE 15, wherein determining the importance metric is at least partially based on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmixing matrix, a loudspeaker activation matrix, or a combination thereof.

17.如EEE 14-16中任一项所述的音频处理方法，其中，确定所述重要性度量至少部分地基于当前收听目标、当前环境噪声估计、所述至少一个回声管理系统的当前性能的估计、或其组合。17. The audio processing method of any one of EEE 14-16, wherein determining the importance metric is based at least in part on current listening goals, current ambient noise estimates, estimates of current performance of the at least one echo management system, or a combination thereof.

18.一种装置，所述装置被配置为执行如EEE 1-17中任一项所述的方法。18. An apparatus configured to perform the method as described in any one of EEE 1-17.

19.一种系统，所述系统被配置成执行如EEE 1-17中任一项所述的方法。19. A system configured to perform the method as described in any one of EEE 1-17.

20.一个或多个其上存储有软件的非暂态介质，所述软件包括用于控制一个或多个设备执行如EEE 1-17中任一项所述的方法的指令。20. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any one of EEE 1-17.

Claims

1. An audio processing method for managing echoes of a first audio device of a plurality of audio devices of an audio system, wherein each audio device of the plurality of audio devices includes one or more loudspeakers, Wherein, the first audio device further includes a control system, wherein the control system includes an echo management system, the echo management system includes an acoustic echo canceller (AEC), an acoustic echo suppressor (AES), or an AEC and an AES. For both, the methods include:

A plurality of echo references are obtained by the control system of the first audio device, the plurality of echo references including at least one echo reference for each of the plurality of audio devices, each echo reference corresponding to Audio data played back by the one or more loudspeakers of the corresponding audio device;

An importance estimate is made by the control system for each of the plurality of echo references, wherein making the importance estimate involves determining, by the echo management system of the first audio device, each The expected contribution of echo references to echo mitigation;

Selecting, by the control system and based at least in part on the significance estimate, one or more echo references from the plurality of echo references;

providing the one or more selected echo references to the echo management system by the control system; and

Echo is suppressed or eliminated by the echo management system of the first audio device based at least in part on the one or more selected echo references.

2. The audio processing method of claim 1, wherein obtaining the plurality of echo references involves:

receiving a content stream including audio data; and

One or more echo references of the plurality of echo references are determined based on the audio data.

3. The audio processing method as claimed in claim 2, further comprising:

The audio data is rendered by the control system for reproduction on the first audio device to produce a local speaker feed signal; and

A local echo reference corresponding to the local loudspeaker feed signal is determined.

4. The audio processing method of claim 3, wherein obtaining the plurality of echo references involves determining one or more non-local echo references based on the audio data, each of the non-local echo references corresponding to Non-native speaker feeds for playback on another audio device in the audio environment.

5. The audio processing method of claim 3, wherein obtaining the plurality of echo references involves receiving one or more non-local echo references, each of the non-local echo references corresponding to the A non-local speaker feed that is played back on another audio device in the audio environment.

6. The audio processing method of claim 5, wherein receiving the one or more non-local echo references involves receiving the one or more non-local echoes from one or more other audio devices of the audio environment refer to.

7. The audio processing method of claim 5, wherein receiving the one or more non-local echo references involves receiving each of the one or more non-local echo references from a single other device of the audio environment. one.

8. The audio processing method of any one of claims 1 to 7, further comprising a cost determination involving determining a cost of at least one of the plurality of echo references, wherein selecting the One or more selected echo references are determined based at least in part on the cost.

9. The audio processing method of claim 8, wherein the cost determination is based on network bandwidth required for transmitting the at least one echo reference, encoding computational requirements for encoding the at least one echo reference, Decoding computational requirements for decoding the at least one echo reference, echo management system computational requirements for using the at least one echo reference by the echo management system, or a combination thereof.

10. An audio processing method as claimed in claim 8 or claim 9, wherein the cost determination is based on a replica of the at least one echo reference in the time or frequency domain, a downsampling of the at least one echo reference version, lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, a method of compressing relatively more important echo references less than relatively less important echo references, or its combination.

11. An audio processing method as claimed in any one of claims 8 to 10, wherein the cost determination is based on a method of compressing relatively more important echo references less than relatively less important echo references.

12. The audio processing method of any one of claims 1 to 11, further comprising determining a current echo management system performance level, wherein selecting the one or more selected echo references is based at least in part on the current echo Manage system performance levels.

13. An audio processing method as claimed in any one of claims 1 to 12, wherein making the importance estimate involves determining an importance measure for a corresponding echo reference.

14. The audio processing method of claim 13, wherein determining the importance metric is based at least in part on a level of the corresponding echo reference, a uniqueness of the corresponding echo reference, a temporal duration of the corresponding echo reference properties, the audibility of the corresponding echo reference, or a combination thereof.

15. The audio processing method of claim 13 or claim 14, wherein determining the importance metric is based at least in part on metadata corresponding to audio device layout, loudspeaker metadata, and received audio. The data corresponds to metadata, an upmix matrix, a loudspeaker activation matrix, or a combination thereof.

16. The audio processing method of any one of claims 13 to 15, wherein determining the importance metric is based at least in part on current listening goals, current ambient noise estimates, and estimates of current performance of the echo management system , or a combination thereof.

17. The audio processing method of any one of the preceding claims, wherein the audio devices of the audio system are communicatively coupled via a wired or wireless communications network, and wherein the plurality of echo references are generated via the obtained from wired or wireless communication networks.

18. An apparatus configured to perform the method of any one of claims 1 to 17.

19. A system configured to perform the method of any one of claims 1 to 17.

20. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any one of claims 1 to 17.