WO2025207648A1

WO2025207648A1 - Direction-of-arrival estimation based on analysis of signals from a microphone array

Info

Publication number: WO2025207648A1
Application number: PCT/US2025/021361
Authority: WO
Inventors: Ziqing Li; Kai Li
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2024-03-27
Filing date: 2025-03-25
Publication date: 2025-10-02
Anticipated expiration: 2026-09-27

Abstract

A method involving receiving, by a control system, microphone signals obtained by a microphone array; enhancing, by the control system, directional components of the microphone signals, to produce directionally-enhanced microphone signals; calculating, by the control system, spatial covariance matrices based on the directionally- enhanced microphone signals; vectorizing, by the control system, the spatial covariance matrices to produce spatial covariance vectors; determining, by the control system, reference spatial covariance matrices of impulse responses for a range of directions; vectorizing, by the control system, the reference spatial covariance matrices to produce reference spatial covariance vectors; calculating, by the control system, similarities between the spatial covariance vectors and the reference spatial covariance vectors; and estimating, by the control system, one or more sound source directions of arrival based on the similarities.

Description

DIRECTION-OF-ARRIVAL ESTIMATION BASED ON ANALYSIS OF SIGNALS FROM A MICROPHONE ARRAY CROSS-REFRENCE TO RELATED APPLICTIONS [0001] This application claims the benefit of priority from PCT Application No. PCT/CN2024/084094, filed on 27 March 2024, US Provisional Application No. 63/641,818 filed on 2 May 2024, and European Application No. 24173799.8 filed on 2 May 2024, each of which is incorporated by reference herein in its entirety. TECHNICAL FIELD [0002] This disclosure pertains to devices, systems and methods for automatic direction-of-arrival (DOA) estimation based on signals from an array of microphones. BACKGROUND [0003] The ability to locate sound sources accurately is of great importance in various applications such as beamforming, acoustic scene analysis, spatial recording, etc. DOA estimation techniques are used to determine the direction from which a sound wave is arriving at a microphone array. Although existing systems and methods for DOA estimation provide benefits, improved systems and methods would be desirable. SUMMARY [0004] At least some aspects of the present disclosure may be implemented via methods. Some such methods may involve direction-of-arrival (DOA) estimation. For example, some methods may involve receiving, by a control system, microphone signals obtained by a microphone array. Some methods may involve enhancing, by the control system, directional components of the microphone signals, to produce directionally-enhanced microphone signals. Some methods may involve calculating, by the control system, spatial covariance matrices based on the directionally-enhanced microphone signals. Some methods may involve vectorizing, by the control system, the spatial covariance matrices to produce spatial covariance vectors. Some methods may involve determining, by the control system, reference spatial covariance matrices of impulse responses for a range of directions. Some methods may involve vectorizing, by the control system, the reference spatial covariance matrices to produce reference spatial covariance vectors. Some methods may involve calculating, by the control system, similarities between the spatial covariance vectors and the reference spatial covariance vectors. Some methods may involve estimating one or more sound source directions of arrival based on the similarities. [0005] According to some examples, the range of directions may be based on prior information. However, in some examples the range of directions may not be based on prior information. In some such examples, the range of directions may include 2π radians in at least one plane. According to some examples, enhancing the directional components of the microphone signals may involve suppressing diffuse components of the microphone signals. [0006] In some examples, enhancing the directional components of the microphone signals may involve applying a time-frequency mask. In some such examples, the time-frequency mask may be based on an estimated direct-to- diffuse ratio. According to some examples, the estimated direct-to-diffuse ratio may be estimated based on diffuse power tracking. In some such examples, the diffuse power tracking may be based, at least in part, on eigenvalue distributions of the spatial covariance matrices of the microphone signals. [0007] According to some examples, the similarities between the spatial covariance vectors and the reference spatial covariance vectors may be calculated for each frequency band of a plurality of frequency bands. In some such examples, the similarity in each frequency band may be calculated based on an inner product of a spatial covariance vector and the reference spatial covariance vectors. According to some examples, the similarity in each frequency band may be calculated based on distances between a spatial covariance vector and the reference spatial covariance vectors. [0008] Some methods may involve estimating, by the control system, a confidence metric for each estimated sound source direction of arrival. In some such examples, the confidence metric may be based, at least in part, on estimated direction-of-arrival and direct-to-diffuse ratio. [0009] According to some examples, determining the reference spatial covariance matrices of impulse responses may be based, at least in part, on a geometry of the microphone array. determining the reference spatial covariance matrices of impulse responses may involve measuring an impulse response from a point source to each microphone of the microphone array. In some examples, determining the reference spatial covariance matrices of impulse responses may involve finite element modeling. [0010] Some methods may involve determining one or more weighting factors for one or more components of the spatial covariance vectors and applying the one or more weighting factors to the one or more components. In some such examples, the one or more weighting factors may be based, at least in part, on inner products of, or distances between, a spatial covariance vector and the reference spatial covariance vectors. In some examples, determining the one or more weighting factors may be based on microphone anomaly detection, microphone occlusion, microphone array geometry, microphone directivity, or combinations thereof. According to some examples, the one or more weighting factors may be determined on a per-fast-Fourier-transform-bin (per-FFT-bin) basis. Some such methods may involve determining, by the control system, a weighted full-band similarity by accumulating a similarity for each frequency bin of a plurality of frequency bins. In some examples, an estimated sound source direction of arrival may be based on the weighted full-band similarity. According to some examples, determining the one or more weighting factors on the per- FFT-bin basis may involve applying one or more time-frequency masks. In some such examples, the one or more time-frequency masks may be based on a direct- to-diffuse ratio, a signal-to-noise ratio, onset detection for extracting a direct portion of a sound source signal, an audio object extraction algorithm, a wind noise detection, a voiced/unvoiced classification or combinations thereof. [0011] Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented in a non-transitory medium having software stored thereon. [0012] At least some aspects of the present disclosure may be implemented by an apparatus or by a system. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus may include an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof. In some examples, the apparatus may be one of the above-referenced audio devices. However, in some implementations the apparatus may be another type of device, such as a mobile device, a laptop, a server, etc. [0013] Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale. BRIEF DESCRIPTION OF THE DRAWINGS [0014] Figure 1A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure. [0015] Figure 1B shows additional examples of blocks that may be configured for implementing various aspects of this disclosure. [0016] Figure 2 shows additional examples of blocks that may be configured for implementing various aspects of this disclosure. [0017] Figure 3 is a flow diagram that outlines another example of a method that may be performed by an apparatus or system such as those disclosed herein. [0018] Figure 4 is a block diagram of an immersive voice and audio services (IVAS) coder/decoder (“codec”) framework for encoding and decoding IVAS bitstreams, according to one or more embodiments. [0019] Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION [0020] The ability to locate sound sources accurately is of great importance in various applications such as beamforming, acoustic scene analysis, spatial recording, etc. Direction-of-arrival (DOA) estimation techniques are used to determine the direction from which a sound wave is arriving at a microphone array. [0021] An important set of statistic parameters for a microphone array is the spatial covariance matrix, which captures the statistical relationship between the signals received by each microphone in the array, and can be used to estimate the direction-of-arrival of a sound source based on the spatial correlation of the signals. Many state-of-the-art direction-of-arrival methods estimate the direction of a sound source by taking the advantage of the spatial covariance matrix, such as beamforming-based methods, subspace-based methods, and maximum likelihood estimation. [0022] Our proposed direction-of-arrival estimation methods are based on analyzing the similarity of covariance matrices of the directional sound and the impulse responses of possible directions. Some disclosed methods involve extracting the directional part from the observed signals using spatial covariance analysis. The direction-of-arrival may be estimated from the similarity of the covariance of the directional sound and the covariances of impulse responses of possible directions. [0023] Our disclosed methods differ from existing techniques in several ways. Firstly, in some disclosed methods the directional part is extracted as the input of estimation module. Secondly, in some disclosed methods the spatial covariance matrix is vectorized so that we can measure the similarity of two spatial covariance matrices using the distance of the corresponding vectors. Thirdly, in some disclosed methods the direction-of-arrival is estimated from the similarity measurement between the observed signals and the impulse responses of possible directions. Finally, some disclosed methods output a confidence value to show how much the estimated result can be trusted. [0024] Figure 1A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure. As with other figures provided herein, the types and numbers of elements shown in Figure 1A are merely provided by way of example. Other implementations may include more, fewer and/or different types and numbers of elements. According to some examples, the apparatus 101 may be, or may include, a device that is configured for performing at least some of the methods disclosed herein, such as a smart audio device, a laptop computer, a cellular telephone, a tablet device, a smart home hub, etc. In some such implementations the apparatus 101 may be, or may include, a server that is configured for performing at least some of the methods disclosed herein. [0025] In this example, the apparatus 101 includes an interface system 105 and a control system 110. In some implementations, the control system 110 may be configured for performing, at least in part, the methods disclosed herein. The control system 110 may, in some implementations, be configured for receiving, via the interface system 105, microphone signals obtained by a microphone array. In some examples, the microphone array may be in or on another device, whereas in other examples, the microphone array may be part of the apparatus 101. In some examples, the microphone signals may correspond to user-generated content (UGC). According to some examples, the microphone array may be part of a wearable audio capture device, such as true wireless (TWS) earbuds, smart glasses, etc. [0026] In some examples, the control system 110 may be configured for enhancing directional components of the microphone signals, to produce directionally-enhanced microphone signals. According to some examples, the control system 110 may be configured for calculating spatial covariance matrices based on the directionally-enhanced microphone signals. In some examples, the control system 110 may be configured for vectorizing the spatial covariance matrices to produce spatial covariance vectors. According to some examples, the control system 110 may be configured for determining reference spatial covariance matrices of impulse responses for a range of directions. In some examples, the control system 110 may be configured for vectorizing the reference spatial covariance matrices to produce reference spatial covariance vectors. According to some examples, the control system 110 may be configured for calculating similarities between the spatial covariance vectors and the reference spatial covariance vectors and for estimating one or more sound source directions of arrival based on the similarities. [0027] The interface system 105 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 105 may include one or more wireless interfaces. The interface system 105 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 105 may include one or more interfaces between the control system 110 and a memory system, such as the optional memory system 115 shown in Figure 1A. However, the control system 110 may include a memory system in some instances. [0028] The control system 110 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components. [0029] In some implementations, the control system 110 may reside in more than one device. For example, a portion of the control system 110 may reside in a device within an environment (such as a laptop computer, a tablet computer, a smart audio device, etc.) and another portion of the control system 110 may reside in a device that is outside the environment, such as a server. In other examples, a portion of the control system 110 may reside in a device within an environment and another portion of the control system 110 may reside in one or more other devices of the environment. [0030] In some examples, the control system 110 may be configured for implementing at least part of a codec for Immersive Voice and Audio Services (IVAS). Some examples are described herein with reference to Figure 4. [0031] Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 115 shown in Figure 1A and/or in the control system 110. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. The software may, for example, include instructions for controlling at least one device to process audio data. The software may, for example, be executable by one or more components of a control system such as the control system 110 of Figure 1A. [0032] In some examples, the apparatus 101 may include the optional microphone system 120 shown in Figure 1A. The optional microphone system 120 may include one or more microphones. In some implementations, one or more of the microphones may be part of, or associated with, another device, such as a loudspeaker, a smart audio device, a wearable audio capture device such as TWS earbuds, smart glasses, etc. [0033] According to some implementations, the apparatus 101 may include the optional loudspeaker system 125 shown in Figure 1A. The optional loudspeaker system 125 may include one or more loudspeakers. Loudspeakers may sometimes be referred to herein as “speakers.” In some examples, at least some loudspeakers of the optional loudspeaker system 125 may be arbitrarily located . For example, at least some speakers of the optional loudspeaker system 125 may be placed in locations that do not correspond to any standard prescribed speaker layout, such as Dolby 5.1, Dolby 5.1.2, Dolby 7.1, Dolby 7.1.4, Dolby 9.1, Hamasaki 22.2, etc. In some such examples, at least some loudspeakers of the optional loudspeaker system 125 may be placed in locations that are convenient to the space (e.g., in locations where there is space to accommodate the loudspeakers), but not in any standard prescribed loudspeaker layout. [0034] In some implementations, the apparatus 101 may include the optional sensor system 130 shown in Figure 1A. The optional sensor system 130 may include a touch sensor system, a gesture sensor system, one or more cameras, etc. [0035] In some implementations, the apparatus 101 may include the optional display system 135 shown in Figure 1A. The optional display system 135 may include one or more displays, such as one or more light-emitting diode (LED) displays. In some instances, the optional display system 135 may include one or more organic light-emitting diode (OLED) displays. In some examples wherein the apparatus 101 includes the display system 135, the sensor system 130 may include a touch sensor system and/or a gesture sensor system proximate one or more displays of the display system 135. According to some such implementations, the control system 110 may be configured for controlling the display system 135 to present a graphical user interface (GUI), such as a GUI related to implementing one of the methods disclosed herein. [0036] Figure 1B shows additional examples of blocks that may be configured for implementing various aspects of this disclosure. As with other figures provided herein, the types and numbers of elements shown in Figure 1B are merely provided by way of example. Other implementations may include more, fewer and/or different types and numbers of elements. According to this example, the blocks include a directional sound enhancement module 145, a similarity calculation and DOA estimation module 155 and a robustness control module 165, all of which are implemented by an instance of the control system 110 of Figure 1A. The directional sound enhancement module 145, similarity calculation and DOA estimation module 155 and robustness control module 165 may, for example, be implemented by the control system 110 according to software or other instructions stored on one or more non-transitory and computer- readable media. [0037] In this example, the directional sound enhancement module 145 is configured to receive inputs 140, which are, or include, microphone signals obtained by a microphone array. According to this example, the directional sound enhancement module 145 is configured to enhance directional components of the microphone signals, to produce directionally-enhanced microphone signals 147 and to provide the directionally-enhanced microphone signals 147 to the similarity calculation and DOA estimation module 155. In some examples, the directional sound enhancement module 145 is configured to estimate the a priori direct-to-diffuse ratio and to suppress diffuse components of the microphone signals. In some such some examples, the directional sound enhancement module 145 is configured to track the diffuse sound power using a metric of diffuseness level derived from microphone array spatial covariance analysis. [0038] According to this example, the similarity calculation and DOA estimation module 155 is configured to estimate the DOA of the main sound source indicated by the microphone signals based, at least in part on the directionally- enhanced microphone signals 147 and on impulse response data 150, and to output estimated DOA information 175. In some examples, the similarity calculation and DOA estimation module 155 may be configured to calculate spatial covariance matrices based on the directionally-enhanced microphone signals. According to some examples, the similarity calculation and DOA estimation module 155 may be configured to measure the distance between the spatial covariance matrix of the directional sound and the covariance matrices of possible sound directions in each frequency band of a plurality of frequency bands and estimate the direction of the main sound source. In some examples, the range of directions may be based on prior information. However, in other examples, the range of directions is not based on prior information. In some such examples, the range of directions may include 2π radians in at least one plane. [0039] In some examples, the similarity calculation and DOA estimation module 155 may be configured to vectorize the spatial covariance matrices, to produce spatial covariance vectors. According to some examples, the similarity calculation and DOA estimation module 155 may be configured to determine reference spatial covariance matrices of impulse responses for a range of directions and to vectorize the reference spatial covariance matrices to produce reference spatial covariance vectors. In some examples, the similarity calculation and DOA estimation module 155 may be configured to calculate similarities between the spatial covariance vectors and the reference spatial covariance vectors, and to estimate one or more sound source directions of arrival based on the similarities. [0040] According to this example, the robustness control module 165 is configured to estimate and output confidence metrics 170 based at least in part on the control signals 160. Some examples of the control signals 160 are described herein with reference to Figure 2. In some examples, the robustness control module 165 may be configured to estimate and output confidence metrics 170 corresponding to each sound source DOA that is estimated by the similarity calculation and DOA estimation module 155. The confidence metrics 170 may, for example, be based, at least in part, on a direct-to-diffuse ratio. According to some examples, the robustness control module 165 may be configured to steer the similarity calculation and direction-of-arrival estimation and to estimate the confidence value for the estimated directions of arrival. Some examples of steering the similarity calculation and the DOA estimation using weighted covariance matrices are described below. [0041] Figure 2 shows additional examples of blocks that may be configured for implementing various aspects of this disclosure. As with other figures provided herein, the types and numbers of elements shown in Figure 2 are merely provided by way of example. Other implementations may include more, fewer and/or different types and numbers of elements. According to this example, the directional sound enhancement module 145, similarity calculation and DOA estimation module 155 and robustness control module 165 are implemented by an instance of the control system 110 of Figure 1A and are instances of the corresponding modules that are described with reference to Figure 1B. [0042] In the example shown in Figure 2, an instance of the control system 110 of Figure 1A is also configured to implement the short-time Fourier transform (STFT) block 205, the fast Fourier transform (FFT) block 210 and the covariance analysis blocks 215a and 215b. According to this example, the STFT block 205 is configured to output a sequence of Fourier transforms 207 to the directional sound enhancement module 145. Here, the sequence of Fourier transforms 207 correspond to time-windowed microphone signals—the inputs 140—obtained by a microphone array. [0043] According to this example, the directional sound enhancement module 145 is configured to enhance directional components of representations of the microphone signals—the sequence of Fourier transforms 207, in this example— to produce directionally-enhanced microphone signals 147 and to provide the directionally-enhanced microphone signals 147 to the covariance analysis block 215a. In this example, the covariance analysis block 215a is configured to calculate spatial covariance matrices 217a based on the directionally-enhanced microphone signals 147. [0044] In this example, the FFT block 210 is configured to perform FFTs on the impulse response data 150—which is received by the FFT block 210 in the time domain—and to output frequency-domain impulse response data 212 to the covariance analysis block 215b. In this example, the covariance analysis block 215b is configured to compute reference spatial covariance matrices 217b based on the frequency-domain impulse response data 212. [0045] According to this example, the control signals 160 include anomaly detection data, time/frequency (T-F) masks from one or more other modules, direct-to-diffuse ratio data, or combinations thereof. In this example, the directional sound enhancement module 145 is configured to generate the direct- to-diffuse ratio data and to provide the direct-to-diffuse ratio data to the robustness control module 165. [0046] Following are details of processes that may be performed by the blocks shown in Figures 1A, 1B and 2. Calculating Reference Covariance Matrices for Multiple Directions [0047] In some examples, the process of calculating reference covariance matrices involves selecting a sufficiently large number of locations from which impulse responses will be measured or estimated. The distances between those locations and the center of the microphone array depend on the specific application. If the near-field direction-of-arrival estimation is what will be preferred or emphasized for a particular use case, those distances should be small. If a particular use case or application focuses on or primarily involves the usage in the far field, in some examples the distances between the locations and the center of the microphone array will be large enough to satisfy the free-field condition. The density of the possible locations may be chosen according to both computational complexity and estimation performance. Prior knowledge—for example, prior knowledge regarding likely sound source locations and/or prior knowledge of microphone array characteristics can be helpful for choosing the possible locations in a sub-space with suitable density. For example, if a cell phone will be used to record sound for a particular use case, the selected locations may be limited to a plane in which microphones of the cell phone are located. Similarly, in some examples the selected locations may be primarily in front of a phone, e.g., in a direction that the phone camera is facing. If no such prior knowledge is available, some disclosed examples involve choosing evenly- spaced locations in all directions. [0048] According to some examples, the impulse responses from the chosen locations to the microphones are obtained from acoustic measurement. In other examples, the impulse responses from the chosen locations to the microphones may be generated by one or more simulation methods. In some such examples, the simulation may be based on a complicated model, such as finite element modeling which takes the influence of devices on the impulse response into consideration. In other examples, the simulation may be based on a simple model, such as free-field modeling which generates impulse responses in free- field using only the microphone geometry. [0049] According to some examples, a time window, such as a rectangle window, is applied to measured impulse responses, in order to extract the directional part from the measured impulse responses. In some such examples, the FFT block 210 of Figure 2 may be configured to apply the time window. According to some such examples, the rectangle window length is not larger than the window length applied by the STFT block 205 to microphone signals—the inputs 140—obtained by a microphone array. [0050] Here, we take an impulse response of direction θ as an example to calculate the reference spatial covariance matrix. In this example, the FFT points used by the FFT block 210 are equal to those of the STFT applied by the STFT block 205 in order to keep the same frequency resolution with the observed signals. In this example, the impulse response IR may be determined as follows: ^_{^^^^, ^^ = ^^^^^^^^^, ^^^, ^ = 1 … ^,} where ^ represents the number of the microphones, n represents the sample index in the time domain (seconds) and k represents the bin index in the frequency domain (Hertz). We can rewrite the above formula in vector form, as follows: ^_{^^^, ^^ = ^^^^^^^^, ^^^.} [0051] The reference spatial covariance matrix Cr may be calculated by: ^_{^^, ^ ^ ^^ ^ ^^^} _{^ ^ = ^^ ^, ^ ^^ ^, ^} In the foregoing expression, and elsewhere herein, the superscript H indicates the conjugate transpose of a matrix. Obviously, the spatial covariance matrix Cr is a Hermitian matrix. [0052] We can use the trace of the matrix to normalize the reference spatial covariance matrix, as follows: ^_{^^^, ^^ = ^^^^, ^^⁄ ^^^^ ^^^^^, ^^^} [0053] According to some disclosed methods, one option to meter the similarity of two matrices is the inner product of the two vectors that are obtained by vectorizing the two matrices. We disclose a method to vectorize a matrix here. Taking an array with three microphones as an example, if an element of the normalized matrix is denoted as ^_",#, the reference spatial covariance matrix can be vectorized as: ^ _{$^^^, ^^ = % ^ &^^^^, ^^' = ^$^ ^} _{( , $) ^ , where:} _^ _{^ ^^,,-^, ^^^^,,-^'} _{^0054^ In the foregoing equations, $1^^, ^^ represents the vectorized reference} _{spatial covariance matrix. The first ^ elements in $1^^, ^^, i.e., $( can be viewed} _{as a sub-vector related to amplitude differences. The last ^^^ − 1^ elements in} _{$1^^, ^^, $), is a sub-vector wherein the phase differences are included.} Directional Sound Enhancement [0055] Diffuse sound is typically caused by sound waves bouncing off surfaces in the environment and arriving at the microphones from multiple directions. This diffuse sound can make it difficult for direction-of-arrival algorithms to accurately estimate the direction from which a particular sound is coming. [0056] By enhancing the directional sound, which means reducing the diffuse part, the direction-of-arrival methods can more easily focus on the direct sound that is arriving at the microphones from the intended direction. Enhancing the directional sound usually leads to a more accurate direction estimation of the coming sound. [0057] In this disclosure, we use 3_^^^^ to represent the observed signal from the m-th microphone. In some examples, the observed signals from ^ microphones are transformed to the frequency domain using STFT, for example by the STFT block 205 of Figure 2. The FFT points of the STFT should be equal to the FFT points applied by the STFT block 205 in calculating the reference covariance matrices. The output of the STFT block 205 may be represented as follows: [0058] Some examples may involve rewriting the above formula in vector form as follows: 6_{^^, 7^ = 5^^^^8^^^^.} [0059] In some such examples, the spatial covariance matrix is calculated by: ^_{^^, 7^ = 6^^, 7^^6^ ^^^} _{9 ^, 7 ,} _{where the spatial covariance matrix ^9^^, 7^ is a Hermitian matrix.} [0060] In some examples, the trace of the matrix may be used to normalize the covariance matrix, as follows: [0061] Some examples involve defining a quantity to measure the diffuseness _{level, which is denoted by the symbol} _{7^. In some such examples, the} _{diffuseness level may vary from 0 to 1, where} _{7^ = 0 implies that the} _{observed signal is totally diffuse, and} _{7^ = 1 means that the signal is pure} _{directional sound, in other words, that the covariance matrix is unity-rank.} [0062] , where = is equal to the sum of the squared eigenvalues of the observed signals’ correlation matrix, namely: [0063] In the foregoing expression, ? represents the eigenvalue of the matrix. Some disclosed examples involve using the eigen-decomposition method to obtain the eigenvalues. However, such methods can be complex and computationally intensive. In some alternative examples, because the sum of the squared eigenvalues is equal to the Frobenius norm of the normalized covariance matrix, ? may be calculated simply by the following expression: [_{0064] Some disclosed examples involve using a smoothed version of} _{7^ to} estimate the diffuse part for directional sound enhancement, for example as [0065] According to some examples, the smoothing factor may be a value in the range from 0.8 to 0.95. [_{0066] In practice, it is almost impossible for} _{7^ to be equal to 1 because} some amount of diffuse sound almost always exists. So, some examples involve m_{odifying the value of} _{7^ nonlinearly to tune the method. One way to} _{modify the value of} _{7^ is as follows:} [0067] In the foregoing expression, Ψ_VUWW^ represents an lower or “floor” threshold of the diffuseness level and Ψ_RSTU represents an upper or “ceiling” threshold of the diffuseness level. In some examples, Ψ_VUWW^ may be in the range from 0.1 to 0.3, for example 0.2, and Ψ_RSTU may be in the range from 0.7 to 0.95, for example 0.8, 0.85, 0.9, etc. The selection of appropriate floor and ceiling thresholds may be dependent on the specific application, characteristics of microphones used for the specific application, etc. For example, the floor threshold may be based on the self-noise of a particular type of microphone. In some examples, the floor and ceiling threshold values may vary by frequency. [0068] Some disclosed methods may use exponential smoothing to estimate the diffuse sound statistics recursively, for example as follows: ^_Z [0069] In the foregoing expression, the “diff” subscripts are used to indicate d_{iffuse components and C\[TVV^^, 7^ represents a smoothing factor. In some such} _{examples, the smoothing factor C\[TVV^^, 7^ may be obtained as follows:} _{C\[TVV^^, 7^ = C[TVV + ^1 − C[TVV^ΨBB^^, 7^} [0070] Some disclosed methods may involve using the estimated diffuse sound statistics to obtain the a posteriori direct-to-diffuse ratio (DDR) after observing the signal, for example as follows: [0071] Alternatively, or additionally, the a priori DDR may be calculated as follows: In the foregoing expression, C_[[^ represents a smoothing factor. In some examples, the smoothing factor may be in the range of 0.9~0.98. In the foregoing expression, `_[T^ represents the gain to suppress the diffuse sound, which may be obtained as follows: [0072] In some examples, the directional part of the observed signals may be obtained as follows: In other words, the directional part of the observed signals can be obtained by applying gdir to suppress the diffuse sound component. Calculating the Spatial Covariance Matrix of the Directional Sound in the Current Frame. [0073] Some disclosed examples involve using the directionally-enhanced _{microphone signals produced by reducing the diffuse part fB[T^^^, 7^ to calculate} the spatial covariance matrix in the current frame. In some such examples, , exponential smoothing may be applied to the directionally-enhanced signals, for example as follows: _{In the foregoing expression, ^Zg\hij^^, 7^ represents the spatial covariance matrix} and C_[T^ represents a frequency-dependent constant, which may in some examples be in the range from 0.8 to 0.95. C_[T^ is generally smaller for high frequencies. [0074] Some disclosed methods involve normalizing the spatial covariance matrix by its trace and vectorizing the spatial covariance matrix, for example as described above regarding the reference covariance matrices. _{In the expression above, $g\hij^^, 7^ represents the vectorized spatial covariance} matrix of the directional sound, which is a column vector in this example.

Similarity Calculation Between the Spatial Covariance Matrix of the Directional Sound and the Spatial Covariance Matrix of the Impulse Responses of Multiple Directions. [0075] The similarity of two matrices can be calculated based on the inner product of the two vectors obtained by vectorizing the two matrices. For example, the similarity of the reference spatial covariance matrix and the spatial covariance matrix may be calculated as follows: _{In the foregoing expression,} _{7, l^ represents the similarity of the spatial} covariance matrix of the direction part of the observed signals and the reference covariance matrices in the current frame, 7-th bin, and direction of l. [0076] The similarity of the vectorized reference spatial covariance matrix and the vectorized spatial covariance matrix can also be implemented as a distance, for example as a Euclidean distance, for example as follows: In the foregoing expression, ^m^n^ represents a small constant value, which is included in order to avoid division by zero. In some examples, const may be very small, for example 0.0000001 or a similarly small value. [0077] The two foregoing methods are advantageous because each of them allows for a simple calculation of the similarity of the reference spatial covariance matrix and the spatial covariance matrix. However, in addition to these two methods, there are a lot of methods that can be applied to meter the similarity of the reference spatial covariance matrix and the spatial covariance matrix. For example, some alternative methods may involve calculating the Frobenius norm of the full matrices. Generating Control Signals to Steer the Direction-Of-Arrival Estimation Weighted Covariance Matrices for Similarity Calculation [0078] In calculating the similarity, some disclosed methods involve assigning weights to the contributions of each microphone and microphone pair to the similarity by integrating a weighting matrix o_p^7^ into the calculation mentioned in the section of similarity calculation. In some such examples, the similarity of the reference spatial covariance matrix and the spatial covariance matrix may be modified as follows: In some alternative examples, the similarity of the reference spatial covariance matrix and the spatial covariance matrix may be modified as follows: The choice of the weighting matrix o_p^7^ may, in some examples, depend on a priori information, such as the geometry of a microphone array, the directivity of microphones, microphone occlusion detection, microphone failure detection. etc. Taking an array with three microphones as an example, if the distance between the 1st and 2nd microphones is small, the main lobe may be too wide to achieve enough resolution. The weights for this pair of microphones may be set to a small value (even equal to zero) to reduce its contribution to the similarity value. [0079] If a microphone is omnidirectional, for example, the 1st microphone, the weight for ^_+,+ can be set to zero since it contributes nothing to the similarity calculation. [0080] If microphone arrays are on mobile devices that people use by holding them with their hands, it’s very common that some microphones are occluded by hands. The results from the module identifying the occluded microphones can be applied as the inputs to derive the weighting matrix. For example, if the 1st _{microphone is occluded, we can set a mask vector o = ^0, ^^} _{R 1,1 to exclude the} contribution of the occluded microphone. In some such examples, the weighting matrix can be given by: [0081] If some microphones malfunction, we can also exclude their contributions to the similarity calculation, for example as described above with reference to occluded microphones. Weighted Similarities for Direction-Of-Arrival Estimation [0082] According to some disclosed methods, the similarity comparison is conducted in each bin. However, some bins may be more important than others for DOA estimation. For example, bins having a high signal-to-noise ratio (SNR) audio signal are normally more important than bins having low SNR audio signal. The bins in which the directional sounds are dominating play an important role in estimating the direction of arrivals. Accordingly, some disclosed methods involve _{generating a weighting vector, denoted herein by ot^^, 7^, to weight the} similarity values in the frequency axis. In some examples, the vector may be scaled (for example, from 0 to 1), whereas in other examples the vector may be binary. [0083] Various methods related to direct-diffuse extraction can be used as a basis for this weighting vector, such as directional sound extraction based on a statistical method or directional sound extraction via a neural network. One example is based on a statistical method mentioned in the above section of directional sound enhancement to create the weighting vector. In this example, the weighting vector may be expressed as follows: o_{t^^, 7^ =} _{where _̂^^, 7^ represents the estimated a priori DDR and ^VUWW^ and ^RSTU represent} floor and ceiling thresholds, respectively, of DDR. [0084] According to some examples, the results from signal-to-noise ratio estimation and onset detection may be integrated into obtaining the weight factor to improve the robustness with regard to noise and reverberance. If the time- frequency masks from audio object extraction modules to extract a specific object _{are available, we can integrate those masks into the calculation of ot^^, 7^ to} force the invented method to focus on the extracted audio object. Some such examples involve combining these masks as follows: o_{t^^, 7^ = ^ o^V^^^, 7^} _^V^ [0085] In the foregoing expression, the subscript tfm represents the time- frequency masks derived from DDR, SNR, onset detection, audio object extraction, and other modules that output time-frequency masks, which in some examples are scaled from 0 to 1. Direction-Of-Arrival Estimation Based on Similarity [0086] Some examples involve taking the angle with the largest similarity between the observed spatial covariance matrix and the reference covariance matrices as the estimated direction, which is denoted herein by l_z^^^. l_{z^7^ = arg m} _{^ax ℒ^7, l^} _{ℒ^7, l^ represents the similarity between the observed spatial covariance matrix} and the reference covariance matrices and may be obtained as follows: In the foregoing expression, 7_z and 7₊ represent the number of the frames of past and future contributions to the estimation of the current frame. The choice of their values depends on the specific application. For a real-time application, if it requires low latency, 7₊ should be very small or equal to zero. If the use case involves tracking a moving sound source, 7₊ should be set according to this movement, for example according to the maximum speed of the sound source. For a non-real-time use case, for example in which all frames have been previously recorded, all frames may be used to make a prediction. In this case, for stationary sound sources we can obtain one estimation result from all frames. Confidence Estimation [0087] In addition to the direction-of-arrival estimation, some disclosed methods also output a confidence metric for each estimated DOA, for example in parallel. The confidence metric indicates how much down-stream tasks should trust the DOA estimation result. [0088] Some examples involve defining a quantity to indicate the bias between the estimated direction of arrival of the current frame and the historical estimated directions of arrival. This bias is denoted herein by the symbol ℬ_^^and may, in some examples, be scaled from 0 to 1. w_{in^7^^|lz^7^ − lz^7 − 7^^|^ ℬ^^g} In the foregoing expression, ℬ_^^g represents the maximum bias of all directions that are being considered in the DOA estimation. In one example, if all directions that are being considered are evenly distributed in the horizontal plane, ℬ_^^g is equal to 180 degrees. In the foregoing expression, win^7^ represents a weighting function. One example of generating the weighting function involves leveraging the 27_,-point symmetric Hanning window, as follows: _{[0089] Since we have already provided examples of how to obtain} _{7^, the} quantity to measure the diffuseness level, and the covariance matrix of the diffuse _{part ^Z[TVV^^, 7^ of the observed signals, we will now provide an examples of how} to combine both with the bias to estimate a confidence metric corresponding to the estimated direction. [0090] As a starting point, this method involves obtaining the full-band versions 7_{^ and ^Z[TVV^^, 7^. One disclosed example involves accumulating} _7^ in the frequency axis, the outputs of nonlinear processing to the smoothed version _{of Ψ^^, 7^.} [0091] A full-band version of a posterior direct-to-diffuse ratio can be estimated by the following: [0092] In some methods, the confidence level may be indicated by the bias. In some such methods, a small value of the bias quantity means the currently estimated direction of arrival aligns with the historical DOA estimations, which will lead to a high confidence level for the current DOA result. The confidence metric is denoted herein by the symbol of ℂ and may be expressed as follows: [0093] The foregoing expression allows the confidence metric to be mapped to a range from zero to one. In the foregoing expression, ^ represents a scale factor derived from the a posteriori DDR and diffuseness level to modify the initial estimation. According to some examples, ^ may be expressed as follows: [0094] Figure 3 is a flow diagram that outlines another example of a method that may be performed by an apparatus or system such as those disclosed herein. The blocks of method 300, like other methods described herein, are not necessarily performed in the order indicated. In some implementation, one or more of the blocks of method 300 may be performed concurrently. Moreover, some implementations of method 300 may include more or fewer blocks than shown and/or described. The blocks of method 300 may be performed by one or more devices, which may be (or may include) a control system such as the control system 110 that is shown in Figure 1A and described above. [0095] In this example, method 300 is a direction-of-arrival (DOA) estimation method. According to this example, block 305 involves receiving, by a control system that includes one or more processors, microphone signals obtained by a microphone array. The microphone signals may, in some examples, correspond with or include the inputs 140 that are shown in Figures 1B and 2. [0096] According to this example, block 310 involves enhancing, by the control system, directional components of the microphone signals, to produce directionally-enhanced microphone signals. According to some examples, enhancing the directional components of the microphone signals may involve suppressing diffuse components of the microphone signals. In some examples, enhancing the directional components of the microphone signals may involve applying a time-frequency mask. The time-frequency mask may, for example, be based on an estimated direct-to-diffuse ratio. According to some examples, the estimated direct-to-diffuse ratio may be estimated based on diffuse power tracking. In some such examples, the diffuse power tracking may be based, at least in part, on eigenvalue distributions of the spatial covariance matrices of the microphone signals. [0097] In this example, block 315 involves calculating, by the control system, spatial covariance matrices based on the directionally-enhanced microphone signals. According to this example, block 320 involves vectorizing, by the control system, the spatial covariance matrices to produce spatial covariance vectors. [0098] According to this example, block 325 involves determining, by the control system, reference spatial covariance matrices of impulse responses for a range of directions. According to some examples, the range of directions may be based on prior information. However, in some examples the range of directions is not based on prior information. In some examples, the range of directions may include 2π radians in at least one plane. In some examples, determining the reference spatial covariance matrices of impulse responses may be based, at least in part, on a geometry of the microphone array. According to some examples, determining the reference spatial covariance matrices of impulse responses may involve measuring an impulse response from a point source to each microphone of the microphone array. In some examples, determining the reference spatial covariance matrices of impulse responses may involve finite element modeling. According to this example, block 330 involves vectorizing, by the control system, the reference spatial covariance matrices to produce reference spatial covariance vectors. [0099] In this example, block 335 involves calculating, by the control system, similarities between the spatial covariance vectors and the reference spatial covariance vectors. According to some examples, the similarities between the spatial covariance vectors and the reference spatial covariance vectors may be calculated for each frequency band of a plurality of frequency bands. In some examples, the similarity in each frequency band may be calculated based on an inner product of a spatial covariance vector and the reference spatial covariance vectors. According to some examples, the similarity in each frequency band may be calculated based on distances between a spatial covariance vector and the reference spatial covariance vectors. [0100] According to this example, block 340 involves estimating, by the control system, one or more sound source directions of arrival based on the similarities. Various examples of block 340 are disclosed herein. [0101] In some examples, method 300 also may involve estimating, by the control system, a confidence metric for each estimated sound source direction of arrival. The confidence metric may be based, at least in part, on estimated direction-of-arrival and direct-to-diffuse ratio. [0102] According to some examples, method 300 also may involve determining one or more weighting factors for one or more components of the spatial covariance vectors and applying the one or more weighting factors to the one or more components. The one or more weighting factors may be based, at least in part, on inner products of, or distances between, a spatial covariance vector and the reference spatial covariance vectors. In some examples, determining the one or more weighting factors may be based on microphone anomaly detection, microphone occlusion, microphone array geometry, microphone directivity, or combinations thereof. According to some examples, the one or more weighting factors may be determined on a per-bin basis, such as a per-FFT-bin basis. Some such methods also may involve determining, by the control system, a weighted full-band similarity by accumulating a similarity for each frequency bin of a plurality of frequency bins. In some examples, an estimated sound source direction of arrival may be based, at least in part, on the weighted full-band similarity. According to some examples, determining the one or more weighting factors on the per-FFT-bin basis may involve applying one or more time- frequency masks. The one or more time-frequency masks may, for example, be based on a direct-to-diffuse ratio, a signal-to-noise ratio, onset detection for extracting a direct portion of a sound source signal, an audio object extraction algorithm, a wind noise detection, a voiced/unvoiced classification or combinations thereof. Example IVAS Codec Framework [0103] Figure 4 is a block diagram of an immersive voice and audio services (IVAS) coder/decoder (“codec”) framework 400 for encoding and decoding IVAS bitstreams, according to one or more embodiments. IVAS is expected to support a range of audio service capabilities, including but not limited to mono to stereo upmixing and fully immersive audio encoding, decoding and rendering. IVAS is also intended to be supported by a wide range of devices, endpoints, and network nodes, including but not limited to: mobile and smart phones, electronic tablets, personal computers, conference phones, conference rooms, virtual reality (VR) and augmented reality (AR) devices, home theatre devices, and other suitable devices. [0104] In this example, the IVAS codec 400 includes IVAS encoder 401 and IVAS decoder 404. In some examples, the IVAS encoder 401, the IVAS decoder 404, or both, may be implemented by one or more instances of the control system 110 of Figure 1A. According to some examples, a control system that implements the IVAS encoder 401, the IVAS decoder 404, or both, also may be configured to perform some or all of the operations disclosed herein, such as the methods that are described with reference to one or more of Figures 1A–3. Alternatively, or additionally, a device that implements one or more disclosed methods may be used to process data that is provided to a control system that implements the IVAS encoder 401, the IVAS decoder 404, or both. In some such examples, one or more of the disclosed DOA estimation methods may be implemented as a part of an acoustic scene analysis module in an Ambisonic upmixer that is configured to transform raw microphone signals (for example, microphone signals from a smartphone, which may be 3-channel signals) into signals in Ambisonic format that are provided to the IVAS encoder 401. The Ambisonic upmixer may be implemented by an instance of the apparatus 101 of Figure 1A, or by a component of the apparatus 101 (such as the control system 110). In some implementations, the estimated DOA, estimated confidence values, or both, may be used as spatial metadata for the IVAS codec 400. [0105] According to this example, the IVAS encoder 401 includes spatial encoder 402 that receives N channels of input spatial audio (e.g., FOA, HOA). In some implementations, spatial encoder 402 may be configured to implement Spatial Reconstruction (SPAR), Directional Audio Coding (DirAC), another spatial audio coding technology, or combinations thereof. In this example, the output of spatial encoder 402 includes a spatial metadata (MD) bitstream (BS) and N_dmx channels of spatial downmix. According to this example, the spatial MD is quantized and entropy coded. In some implementations, quantization can include fine, moderate, coarse and extra coarse quantization strategies and entropy coding can include Huffman or Arithmetic coding. In some implementations, the framework may permit not more than 3 levels of quantization at a given operating mode; however, with decreasing bitrates, in some such implementations the three levels become increasingly coarser overall, to meet bitrate requirements. According to this example, the core audio encoder 403—which may, for example, be based on a mono Enhanced Voice Services (EVS) encoding unit)—is configured to encode N_dmx channels (N_dmx = 1-16 channels) of the spatial downmix into an audio bitstream, which is combined with the spatial MD bitstream into an IVAS encoded bitstream transmitted to IVAS decoder 404. [0106] In this example, the IVAS decoder 404 includes core audio decoder 405 (e.g., EVS decoder) that decodes the audio bitstream extracted from the IVAS bitstream to recover the N_dmx audio channels. According to this example, the spatial decoder/renderer 406 (e.g., SPAR/DirAC) decodes the spatial MD bitstream extracted from the IVAS bitstream to recover the spatial MD, and synthesizes/renders output audio channels using the spatial MD and a spatial upmix for playback on various audio systems with different speaker configurations and capabilities. [0107] Some aspects of present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto. [0108] Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device. [0109] Another aspect of present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof. [0110] While specific embodiments and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of this disclosure. [0111] Various aspects of the present disclosure may be appreciated from the following Enumerated Example Embodiments (EEEs): EEE1. A direction-of-arrival (DOA) estimation method, comprising: receiving, by a control system, microphone signals obtained by a microphone array; enhancing, by the control system, directional components of the microphone signals, to produce directionally-enhanced microphone signals; calculating, by the control system, spatial covariance matrices based on the directionally-enhanced microphone signals; vectorizing, by the control system, the spatial covariance matrices to produce spatial covariance vectors; determining, by the control system, reference spatial covariance matrices of impulse responses for a range of directions; vectorizing, by the control system, the reference spatial covariance matrices to produce reference spatial covariance vectors; calculating, by the control system, similarities between the spatial covariance vectors and the reference spatial covariance vectors; and estimating one or more sound source directions of arrival based on the similarities. EEE2. The method of EEE1, wherein the range of directions is based on prior information. EEE3. The method of EEE 1, wherein the range of directions is not based on prior information and wherein the range of directions includes 2π radians in at least one plane. EEE4. The method of any one of EEE1–EEE3, wherein enhancing the directional components of the microphone signals involves suppressing diffuse components of the microphone signals. EEE5. The method of any one of EEE1–EEE4, wherein enhancing the directional components of the microphone signals involves applying a time- frequency mask. EEE6. The method of EEE5, wherein the time-frequency mask is based on an estimated direct-to-diffuse ratio. EEE7. The method of EEE6, wherein the estimated direct-to-diffuse ratio is estimated based on diffuse power tracking. EEE8. The method of EEE7, wherein the diffuse power tracking is based, at least in part, on eigenvalue distributions of the spatial covariance matrices of the microphone signals. EEE9. The method of any one of EEE1–EEE8, wherein the similarities between the spatial covariance vectors and the reference spatial covariance vectors are calculated for each frequency band of a plurality of frequency bands. EEE10. The method of EEE9, wherein the similarity in each frequency band is calculated based on an inner product of a spatial covariance vector and the reference spatial covariance vectors. EEE11. The method of EEE9, wherein the similarity in each frequency band is calculated based on distances between a spatial covariance vector and the reference spatial covariance vectors. EEE12. The method of any one of EEE1–EEE11, further comprising estimating, by the control system, a confidence metric for each estimated sound source direction of arrival. EEE13. The method of EEE12, wherein the confidence metric is based, at least in part, on estimated direction-of-arrival and direct-to-diffuse ratio. EEE14. The method of any one of EEE1–EEE13, wherein determining the reference spatial covariance matrices of impulse responses is based, at least in part, on a geometry of the microphone array. EEE15. The method of any one of EEE1–EEE13, wherein determining the reference spatial covariance matrices of impulse responses involves measuring an impulse response from a point source to each microphone of the microphone array. EEE16. The method of any one of EEE1–EEE13, wherein determining the reference spatial covariance matrices of impulse responses involves finite element modeling. EEE17. The method of any one of EEE1–EEE16, further comprising determining one or more weighting factors for one or more components of the spatial covariance vectors and applying the one or more weighting factors to the one or more components. EEE18. The method of EEE17, wherein the one or more weighting factors are based, at least in part, on inner products of, or distances between, a spatial covariance vector and the reference spatial covariance vectors. EEE19. The method of EEE17 or EEE18, wherein determining the one or more weighting factors is based on microphone anomaly detection, microphone occlusion, microphone array geometry, microphone directivity, or combinations thereof. EEE20. The method of EEE17 or EEE19, wherein the one or more weighting factors are determined on a per-fast-Fourier-transform-bin (per-FFT- bin) basis, further comprising determining, by the control system, a weighted full-band similarity by accumulating a similarity for each frequency bin of a plurality of frequency bins. EEE21. The method of EEE20, wherein an estimated sound source direction of arrival is based on the weighted full-band similarity. EEE22. The method of EEE20 or EEE21, wherein determining the one or more weighting factors on the per-FFT-bin basis involves applying one or more time-frequency masks and wherein the one or more time-frequency masks are based on a direct-to-diffuse ratio, a signal-to-noise ratio, onset detection for extracting a direct portion of a sound source signal, an audio object extraction algorithm, a wind noise detection, a voiced/unvoiced classification or combinations thereof. EEE23. An apparatus configured to perform the method of any one of EEE1–EEE22. EEE24. A system configured to perform the method of any one of EEE1– EEE22. EEE25. One or more non-transitory, computer-readable media having instructions stored thereon for controlling one or more devices to perform the method of any one of EEE1–EEE22.

Claims

CLAIMS What Is Claimed Is: 1. A direction-of-arrival (DOA) estimation method, comprising: receiving, by a control system, microphone signals obtained by a microphone array; enhancing, by the control system, directional components of the microphone signals, to produce directionally-enhanced microphone signals; calculating, by the control system, spatial covariance matrices based on the directionally-enhanced microphone signals; vectorizing, by the control system, the spatial covariance matrices to produce spatial covariance vectors; determining, by the control system, reference spatial covariance matrices of impulse responses for a range of directions; vectorizing, by the control system, the reference spatial covariance matrices to produce reference spatial covariance vectors; calculating, by the control system, similarities between the spatial covariance vectors and the reference spatial covariance vectors; and estimating one or more sound source directions of arrival based on the similarities.

2. The method of claim 1, wherein the range of directions is based on prior information.

3. The method of claim 1, wherein the range of directions is not based on prior information and wherein the range of directions includes 2π radians in at least one plane.

4. The method of any one of claims 1–3, wherein enhancing the directional components of the microphone signals involves suppressing diffuse components of the microphone signals.

5. The method of any one of claims 1–4, wherein enhancing the directional components of the microphone signals involves applying a time-frequency mask.

6. The method of claim 5, wherein the time-frequency mask is based on an estimated direct-to-diffuse ratio.

7. The method of claim 6, wherein the estimated direct-to-diffuse ratio is estimated based on diffuse power tracking.

8. The method of claim 7, wherein the diffuse power tracking is based, at least in part, on eigenvalue distributions of the spatial covariance matrices of the microphone signals.

9. The method of any one of claims 1–8, wherein the similarities between the spatial covariance vectors and the reference spatial covariance vectors are calculated for each frequency band of a plurality of frequency bands.

10. The method of claim 9, wherein the similarity in each frequency band is calculated based on an inner product of a spatial covariance vector and the reference spatial covariance vectors.

11. The method of claim 9, wherein the similarity in each frequency band is calculated based on distances between a spatial covariance vector and the reference spatial covariance vectors.

12. The method of any one of claims 1–11, further comprising estimating, by the control system, a confidence metric for each estimated sound source direction of arrival.

13. The method of claim 12, wherein the confidence metric is based, at least in part, on estimated direction-of-arrival and direct-to-diffuse ratio.

14. The method of any one of claims 1–13, wherein determining the reference spatial covariance matrices of impulse responses is based, at least in part, on a geometry of the microphone array.

15. The method of any one of claims 1–13, wherein determining the reference spatial covariance matrices of impulse responses involves measuring an impulse response from a point source to each microphone of the microphone array.

16. The method of any one of claims 1–13, wherein determining the reference spatial covariance matrices of impulse responses involves finite element modeling.

17. The method of any one of claims 1–16, further comprising determining one or more weighting factors for one or more components of the spatial covariance vectors and applying the one or more weighting factors to the one or more components.

18. The method of claim 17, wherein the one or more weighting factors are based, at least in part, on inner products of, or distances between, a spatial covariance vector and the reference spatial covariance vectors.

19. The method of claim 17 or claim 18, wherein determining the one or more weighting factors is based on microphone anomaly detection, microphone occlusion, microphone array geometry, microphone directivity, or combinations thereof.

20. The method of claim 17 or claim 19, wherein the one or more weighting factors are determined on a per-fast-Fourier-transform-bin (per-FFT-bin) basis, further comprising determining, by the control system, a weighted full-band similarity by accumulating a similarity for each frequency bin of a plurality of frequency bins.

21. The method of claim 20, wherein an estimated sound source direction of arrival is based on the weighted full-band similarity.

22. The method of claim 20 or claim 21, wherein determining the one or more weighting factors on the per-FFT-bin basis involves applying one or more time- frequency masks and wherein the one or more time-frequency masks are based on a direct-to-diffuse ratio, a signal-to-noise ratio, onset detection for extracting a direct portion of a sound source signal, an audio object extraction algorithm, a wind noise detection, a voiced/unvoiced classification or combinations thereof.

23. An apparatus configured to perform the method of any one of claims 1– 22.

24. A system configured to perform the method of any one of claims 1–22.

25. One or more non-transitory, computer-readable media having instructions stored thereon for controlling one or more devices to perform the method of any one of claims 1–22.