EP4668265A1 - Appareil audio et son procédé de fonctionnement - Google Patents

Appareil audio et son procédé de fonctionnement

Info

Publication number: EP4668265A1
Authority: EP; European Patent Office
Prior art keywords: audio; noise; signal; adaptive; beamformed
Prior art date: 2024-06-18
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP24182816.9A

Other languages

German (de)

English (en)

Inventor

Brian Brand Antonius Johannes Bloemendal

Cornelis Pieter Janse

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Koninklijke Philips NV

Original Assignee

Koninklijke Philips NV

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2024-06-18

Filing date

2024-06-18

Publication date

2025-12-24

2024-06-18 Application filed by Koninklijke Philips NV filed Critical Koninklijke Philips NV

2024-06-18 Priority to EP24182816.9A priority Critical patent/EP4668265A1/fr

2025-06-09 Priority to PCT/EP2025/065962 priority patent/WO2025261811A1/fr

2025-12-24 Publication of EP4668265A1 publication Critical patent/EP4668265A1/fr

Status Pending legal-status Critical Current

Links

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/16—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/175—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
- G10K11/178—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase
- G10K11/1781—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase characterised by the analysis of input or output signals, e.g. frequency range, modes, transfer functions
- G10K11/17821—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase characterised by the analysis of input or output signals, e.g. frequency range, modes, transfer functions characterised by the analysis of the input signals only
- G10K11/17823—Reference signals, e.g. ambient acoustic environment
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/16—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/175—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
- G10K11/178—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase
- G10K11/1785—Methods, e.g. algorithms; Devices
- G10K11/17853—Methods, e.g. algorithms; Devices of the filter
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/16—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/175—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
- G10K11/178—Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound by electro-acoustically regenerating the original acoustic waves in anti-phase
- G10K11/1787—General system configurations
- G10K11/17879—General system configurations using both a reference signal and an error signal
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K11/00—Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
- G10K11/18—Methods or devices for transmitting, conducting or directing sound
- G10K11/26—Sound-focusing or directing, e.g. scanning
- G10K11/34—Sound-focusing or directing, e.g. scanning using electrical steering of transducer arrays, e.g. beam steering
- G10K11/341—Circuits therefor
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers
- H04R3/005—Circuits for transducers for combining the signals of two or more microphones
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/30—Means
- G10K2210/301—Computational
- G10K2210/3012—Algorithms
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/30—Means
- G10K2210/301—Computational
- G10K2210/3038—Neural networks
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10K—SOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
- G10K2210/00—Details of active noise control [ANC] covered by G10K11/178 but not provided for in any of its subgroups
- G10K2210/30—Means
- G10K2210/301—Computational
- G10K2210/3046—Multiple acoustic inputs, multiple acoustic outputs
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
- H04R2430/20—Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic

Definitions

the invention relates to generation of an audio signal from a plurality of audio signals and in particular, but not exclusively, to generation of an audio signal capturing audio, such as speech, in an audio environment.
a problem in many scenarios and applications is that the desired audio source is typically not the only audio source in the environment. Rather, in typical audio environments there are many other audio/noise sources which are being captured by the microphone. Audio processing is often used to improve the capture of audio, and in particular to post-process the captured audio to improve the resulting audio signals.
audio is captured by a plurality of microphones at different positions.
a linear array of a plurality of microphones is often used to capture audio in an environment, such as in a room.
the use of multiple microphones allows spatial information of the audio to be captured and applications have been developed that exploit such spatial information allowing improved and/or new services.
One frequently used approach is to try to separate audio sources by applying audio beamforming to form beams in the direction of arrival of audio from specific audio sources.
this may provide advantageous performance in many scenarios, it is not optimal in all cases.
it may not provide optimal source separation in some cases, and indeed in some applications such a spatial beamforming may not provide audio properties that are ideal for further processing to achieve a given effect.
traditional capture and processing may result in an audio signal including undesired amounts of reverberation which for some applications, especially for many speech applications, may be disadvantageous.
the audio capture there is in many applications a strong desire for the audio capture to be able to adapt to the specific audio scene and sources, including for example adapting to changes in who is currently speaker or in changes of the positions of the current speaker(s). It is also desired that speech capture is resilient to other audio and noise being present in the scene. For example, it is often desirable for a capture system to be able to capture e.g. a desired speaker even in the presence of another strong interfering (e.g. noise) audio source.
another strong interfering e.g. noise
an improved approach would be advantageous, and in particular an approach allowing reduced complexity, increased flexibility, facilitated implementation, reduced cost, improved audio capture, improved spatial perception/differentiation of audio sources, improved audio source separation, improved audio/speech application support, reduced dependency on known or static acoustic properties, improved flexibility and customization to different audio environments and scenarios, improved audio beamforming, improved resilience to noise and unwanted audio sources in the audio scene, improved reverberation performance, an improved trade-off between performance and complexity/ resource usage, and/or improved performance would be advantageous.
the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.
an audio apparatus comprising: a receiver arranged to receive a plurality of audio signals; a plurality of noise cancelling beamformers each receiving the plurality of audio signals and generating a beamformed audio signal, each noise cancelling beamformer comprising: a beamform circuit arranged to generate a beamformed signal capturing audio in a beam and a noise reference signal capturing audio outside the beam, an adaptive filter arranged to filter the noise reference signal to generate a filtered noise signal, and a compensation circuit arranged to generate the beamformed audio signal by removing the filtered noise signal from the beamformed signal; an adaptive beamformer arranged to receive the beamformed audio signals from the plurality of noise cancelling beamformers and to generate an output audio signal by applying a beamform operation to the beamformed audio signals; an audio source detector arranged to detect a presence of a desired audio source in the plurality of audio signals; an adaptation circuit arranged to adapt the adaptive beamformer and the adaptive filters, the adaptation circuit being arranged to adapt the adaptive beamformer when the
the invention may provide improved audio capture in many embodiments.
improved performance in many noisy environments may often be achieved.
the approach may in particular provide improved speech capture in many challenging audio environments.
the approach may provide an audio capturing apparatus having reduced sensitivity to e.g. noise, reverberation, and reflections.
improved capture of audio sources in the far field can often be achieved.
the approach may in many embodiments provide improved noise resistance/noise reducing/suppression.
the approach may in many embodiments achieve reduced reverberation in the output audio signal (in particular of reverberation resulting from a source being captured by a plurality of the noise cancelling beamformers).
the approach may in many scenarios provide an improved trade-off between noise reduction and reverberation reduction.
the approach may typically provide improved capture of desired audio in the presence of noise, and potentially in the presence of a strong interferer/noise source.
the approach may in many scenarios provide improved extraction/separation/isolation of desired audio and/or may reduce the impact of a strong undesired audio source.
the synergistic effect between the different beamformers may in many cases result in improved audio capture and in particular improved noise performance and/or reduced reverberation.
the approach may allow improved attenuation of a noise audio source and allow improved extraction of desired audio sources, such as specifically speech sources.
the audio source detector may be arranged to detect a presence of a desired audio source in the plurality of audio signals by a processing/evaluation of the plurality of audio signals, and/or by analyzing a signal determined from the plurality of audio signals, such as one or more of the beamformed audio signals and/or the output audio signal.
the audio apparatus may comprise a noise source detector arranged to detect a presence of a noise source and an output circuit arranged to generate an output signal, the output circuit being arranged to switch between generating the output signal from the output audio signal of the adaptive beamformer when the presence of a noise source is detected and generating the output signal from at least one of the beamformed audio signals when the presence of a noise source is not detected.
the noise cancelling beamformers may specifically be Generalized Sidelobe Canceller beamformers.
the adaptive beamformer may be arranged to generate the output signal by combining the beamformed audio signals, the combination including an adaptive filtering/weighting of each beamformed audio signal.
the adaptation circuit may be arranged to adapt the adaptive beamformer in response to the detection of the presence of the desired audio source and to inhibit/prevent adaptation of the adaptive filters in response to the detection of the presence of the audio source.
the compensation circuit may be a subtractor arranged to generate the beamformed audio signal by subtracting the filtered noise signal from the beamformed signal.
the audio apparatus further comprises a noise source detector arranged to detect a presence of a noise source and wherein the adaptation circuit is arranged to adapt the adaptive filters when the noise source is detected and the presence of the desired audio source is not detected.
This may provide improved performance and/or operation in many embodiments. It may typically allow improved separation and extraction of desired audio (e.g. speech audio sources) in the presence of strong interference/ noise.
desired audio e.g. speech audio sources
the adaptation circuit is arranged to adapt only one of the adaptive filters and the adaptive beamformer at any time.
the adaptation circuit may be arranged to prevent a simultaneous update of the adaptive filters and the adaptive beamformer. For example, for a given time interval/segment either the adaptive filters or the adaptive beamformer (or none of these) may be adapted but not both.
the adaptation circuit is arranged to adapt the adaptive beamformer to increase a signal level of the output audio signal.
the adaptation circuit may be arranged to adapt the adaptive beamformer to seek to maximize a signal level of the output audio signal.
a signal level may be an envelope/power/amplitude/energy measure/estimate.
the adaptation circuit is arranged to adapt the adaptive filter of a first noise cancelling beamformer to reduce a signal level of a beamformed audio signal of the first noise cancelling beamformer.
the adaptation circuit may be arranged to adapt the noise cancelling beamformer to seek to minimize a signal level of the output audio signal.
a signal level may be an envelope/power/amplitude/energy measure/estimate.
the beamform operation comprises generating the output audio signal from a combination of the beamformed audio signals.
This may provide improved performance and/or operation in many embodiments.
the audio apparatus further comprises a speech recognition circuit arranged to perform automatic speech recognition on the output audio signal.
the approach may provide a particularly suitable signal for automatic speech recognition.
improved automatic speech recognition can be achieved.
the audio apparatus further comprises a communication circuit arranged to communicate the output audio signal to a remote destination and a selector arranged to switch the output audio signal between being generated by applying a beamform operation to the beamformed audio signals and the output audio signal being generated as a beamformed audio signal of one of the plurality of noise cancelling beamformers in response to a selection between the communication circuit and the speech recognition circuit (being active). Both the communication and speech recognition circuit can be active simultaneously.
This may provide improved performance and/or operation in many embodiments.
the beamform operation comprises combining the beamformed audio signals using different combination parameters in different frequency intervals
the adaptation circuit is arranged to individually adapt the combination parameters in different frequency intervals.
the adaptation circuit is arranged to individually adapt the adaptive filters of the noise cancelling beamformers in different frequency intervals.
the beamform circuit of at least one of the plurality of noise cancelling beamformers is arranged to form a fixed beam.
the approach may provide improved performance and/or operation in many embodiments.
the approach may allow an efficient coverage of a desired region of interest in many embodiments.
the fixed beam may be a non-adaptive beam.
the fixed beam may have fixed directional properties that do not change during operation.
the at least one of the plurality of noise cancelling beamformers may comprise a non-adaptive beamform circuit.
the fixed beam may be formed in a predetermined direction.
a plurality of noise cancelling beamformers may form fixed beams, which may be overlapping beams.
the adaptation circuit is arranged to configure each of the noise cancelling beamformers to track a desired source or to track a noise source, and to exclude beamformed audio signals from noise cancelling beamformers tracking a noise source from the beamform operation.
This may provide improved performance and/or operation in many embodiments.
the audio source detector is arranged to determine a confidence measure for each beamformed audio signal, the confidence measure for a beamformed audio signal being indicative of a likelihood of the beamformed audio signal capturing an audio source; and the adaptive beamformer is arranged to exclude beamformed audio signals from noise cancelling beamformers for which the confidence measure does not meet a criterion.
This may provide improved performance and/or operation in many embodiments.
the audio apparatus comprises an adaptive spatial decorrelation filter arranged to spatially decorrelate the plurality of audio signals prior to the beamform circuits generating the beamformed signals.
This may provide improved performance and/or operation in many embodiments.
each noise cancelling beamformer comprising: a beamform circuit generating a beamformed signal capturing audio in a beam and a noise reference signal capturing audio outside the beam, an adaptive filter filtering the noise reference signal to generate a filtered noise signal, and a compensation circuit generating the beamformed audio signal by removing the filtered noise signal from the beamformed signal; generating an output audio signal by applying a beamform operation to the beamformed audio signals from the plurality of noise cancelling beamformers; detecting a presence of a desired audio source in the plurality of audio signals; and adapting the adaptive beamformer and the adaptive filters, the adaptation of the adaptive beamformer being performed when the presence of the desired audio source is detected and the adaption of the adaptive filters not being performed when the presence of the desired audio source is detected.
FIG. 1 illustrates an example of an audio apparatus which is arranged to generate an output audio signal from a plurality of input signals with the processing being based on using multiple beamform operations including specifically using a hierarchical arrangement of beamform operations.
the audio apparatus may in many scenarios address or mitigate a number of the disadvantages of conventional systems, such as specifically a number of the issues of using a single beamformer to isolate and capture a desired audio signal, such as specifically a desired speaker.
the audio apparatus comprises a receiver 101 which is arranged to receive a plurality/first set of audio signals.
the receiver 101 is specifically arranged to receive audio signals from a microphone array 103 which comprises a plurality of microphones arranged to capture audio in the environment from different respective positions.
the receiver 101 receives audio signals from a set of microphones capturing the audio scene from different positions.
the microphones may for example be arranged in a linear array and relatively close to each other.
the maximum distance between capture points for the audio signals may in many embodiments not exceed 1 meter, 50 cm, 25 cm, or even in some cases 10 cm.
the receiver may comprise an optional echo canceller which may cancel the echoes that originate from acoustic sources (for which a reference signal is available) that are linearly related to the echoes in the microphone signal(s).
This source can for example be a loudspeaker.
An adaptive filter can be applied with the reference signal as input, and with the output being subtracted from the microphone signal to create an echo compensated signal. This can be repeated for each individual microphone. It will be appreciated that such an echo canceller is optional and may be omitted in many embodiments.
the audio capturing apparatus comprises a plurality of noise cancelling beamformers 105 each of which is arranged to generate a beamformed audio signal.
the noise cancelling beamformers 105 combine the signals from the receiver 101 such that an effective directional audio sensitivity of the microphone array 103 is generated.
Each of the noise cancelling beamformers 105 may thus be arranged to generate an output signal which corresponds to a selective capturing of audio in the environment.
FIG. 2 illustrates an example of elements of the noise cancelling beamformers 105.
Each of the noise cancelling beamformers 105 receives the audio signals from the receiver 101 and executes a beamform algorithm to generate a beamformed audio signal.
a noise cancelling beamformer 105 comprises a beamform circuit 201 which is arranged to generate a beamformed signal capturing audio in a beam.
the beamform circuit 201 further generates a noise reference signal which captures audio outside the beam.
the beamform circuit 201 may be arranged to generate the beamformed signal as a weighted combination of the input audio signals with each of these being individually weighted by a weight filter which may for example be a FIR filter.
the filter may be a flat filter and may be represented as a single coefficient, i.e. the filter may be an all pass filter in the form of a single weight/coefficient.
the beamform circuit 201 may be a filter-and-combine (or specifically in most embodiments filter-and-sum) beamformer.
a beamform filter may be applied to each of the microphone signals and the filtered outputs may be combined, typically by simply being added together.
each of the beamform filters has a time domain impulse response which is not a simple Dirac pulse (corresponding to a simple delay and thus a gain and phase offset in the frequency domain) but rather has an impulse response which typically extends over a time interval of no less than 2, 5, 10 or even 30 msec.
the impulse response may often be implemented by the beamform filters being FIR (Finite Impulse Response) filters with a plurality of coefficients.
FIR Finite Impulse Response
the beamforming may be adapted by adapting the filter coefficients.
the FIR filters may have coefficients corresponding to fixed time offsets (typically sample time offsets) and e.g. with the adaptation being arranged to adapt the coefficient values.
the beamform filters may typically have substantially fewer coefficients (e.g. only two or three) but with the timing of these (also) being adaptable.
a particular advantage of the beamform filters having extended impulse responses rather than being a simple variable delay (or simple frequency domain gain/ phase adjustment) is that it allows the noise cancelling beamformers 105 to not only adapt to the strongest, typically direct, signal component. Rather, it allows the noise cancelling beamformers 105 to be adapted to include further signal paths corresponding typically to reflections. Accordingly, the approach allows for improved performance in most real environments, and specifically allows improved performance in reflecting and/or reverberating environments and/or for audio sources further from the microphone array 101.
the noise reference signal may be generated using the same approach but with different filters/coefficients. Specifically, the noise reference signal may be generated by setting filters/coefficients to result in a null/medium sensitivity being formed in the direction of the beam/maximum gain for generating the beamformed signal.
an array of a plurality of microphones 101 are coupled to a beamform circuit 201 which generates an audio source signal z(n) and one or more noise reference signal(s) x(n).
a beamform circuit 201 which generates an audio source signal z(n) and one or more noise reference signal(s) x(n).
US 7 146 012 and US 7 602 926 disclose examples of adaptive beamformers that focus on the speech but also provide a reference signal that often contains (almost) no speech.
the microphone array 101 may in some embodiments comprise only two microphones but will typically comprise a larger number.
the beamform circuit 201 of FIG. 2 specifically creates an enhanced output signal, z(n), by adding the desired part of the microphone signals coherently by filtering the received signals in forward matching filters and adding the filtered outputs. Also, the output signal is filtered in backward adaptive filters having conjugate filter responses to the forward filters (in the frequency domain corresponding to time inversed impulse responses in the time domain). Error signals are generated as the difference between the input signals and the outputs of the backward adaptive filters, and the coefficients of the filters are adapted to minimize the error signals thereby resulting in the audio beam being steered towards the dominant signal.
the generated error signals x(n) can be considered as noise reference signals which are particularly suitable for performing additional noise reduction on the enhanced output signal z(n).
the primary signal z(n) and the reference signal x(n) are typically both contaminated by noise.
an adaptive filter can be used to reduce the coherent noise.
the noise reference signal x(n) is coupled to the input of an adaptive filter 203 which generates a filtered noise signal by filtering the noise reference signal.
the resulting noise reference signal x(n) is then subtracted from the audio source signal z(n) in a subtractor 205 to generate a compensated signal r(n).
the beamformed audio signal output by the beamform circuit 105 is the compensated signal r(n), output by 205 in FIG. 2 , but it will be appreciated that in some embodiments further (post)processing may be included, such as e.g. a filtering.
the noise cancelling beamformers 105 are arranged to explicitly cancel noise. This noise cancelling is achieved by generating a filtered noise signal from the noise reference signal and then subtracting the filtered noise signal from the beamformed signal generated by the beamform circuit 201.
the adaptive filter 203 may be adapted to reduce/minimize the power of the compensated/beamformed audio signal r(n), typically when the desired audio source is not active (e.g. when there is no speech). Such an approach may result in a highly effective suppression of coherent noise resulting in an improved output signal and specifically with improved focus on the desired audio.
the noise cancelling beamformers 105 may specifically employ the architecture of a Generalized Sidelobe Canceller (GSC). These beamformers typically have a target/unit response in their target direction and a lower response in any other direction, where the response gradually decreases for increasing angles from the target direction, with different slopes per frequency.
the response of a beamformer as function of an angle is also called a beampattern.
the beamform circuit 201 In the GSC structure, the beamform circuit 201 generates a beamformed signal z(n) and at least one noise reference x(n).
Noise reference signals are typically generated via a blocking matrix or filter.
the blocking matrix should filter out the signals coming from the target direction of the beamformer and let other signals pass through.
An adaptive filter 203 that filters the noise reference signals completes the GSC structure together with a subtractor 205 removing the output signal from the adaptive filter 203 from the beamformed signal z(n).
the adaptive filter is updated in such a way that when its output is subtracted from the beamformed signal, leading to output beamformed audio signal r(n), coherent noises present in both the primary and noise reference outputs are cancelled.
the adaptive noise cancellation filters 103 Using delay-and-sum beamforming, reflections of desired speech will also appear in the noise references. For the control of the adaptive noise cancellation filters 103, it is therefore typically advantageous that no adaptation/updating is performed when desired speech is present. As a result, via the noise reference signals, the desired signal is injected into or added to the output of the noise-cancelling beamformers. The desired speech in the output signal r(n) is not cancelled; however, the desired speech in the output becomes more reverberated.
the noise cancelling beamformers 105 are however not merely used to select, isolate and extract individual desired audio sources in the audio scene which are then used for the given application, such as e.g. automatic speech recognition. Rather, the audio apparatus further comprises an adaptive beamformer which receives the beamformed audio signals from the noise cancelling beamformers 105 and which proceeds to generate an output audio signal by applying a beamform operation to the beamformed audio signals from the noise cancelling beamformers 105.
the audio apparatus employs a hierarchical arrangement of beamform operations with the beamformed audio signals from a first layer of noise cancelling beamformers 105 being used as input signals for a second layer adaptive beamformer which performs an additional beamform operation on already beamformed (and noise cancelled) audio signals capturing audio in an audio scene.
the adaptive beamformer 107 may use the same beamforming operation/approach as the beamform circuit 201 to generate the beamformed audio signal.
the beamforming operation may accordingly generate the output audio signal from a combination of the beamformed audio signals, with the combination including an individual weight/scaling/filtering of each beamformed audio signal.
each beamformed audio signal may (in the frequency domain) be weighted (typically by a complex weight) before e.g. being summed (with typically this operation being performed individually in each frequency bin).
the audio apparatus further comprises an adaptation circuit 109 which is arranged to adapt the adaptive filters 203 of the noise cancelling beamformers 105 as well as the beamform operation of the adaptive beamformer 107.
FIG. 1 illustrates an example wherein the adaptation circuit 109 is for clarity shown to include a first adaptation controller 111 which is arranged to adapt the adaptive filters 203 and a second adaptation controller 113 which is arranged to adapt the adaptive beamformer 107.
the adaptations may be dependent on each other, share some functionality, be closely integrated, and performed by the same function/circuit/processor etc.
the adaptation circuit 109/ first adaptation controller 111 may be arranged to adapt the coefficients of the adaptive filters 203 and may adapt the weights/filter coefficients of the filters applied to the beamformed audio signals before these are added together.
the adaptation circuit 109/ first adaptation controller 111 may be arranged to adapt/change/modify the adaptive filters 203 to reduce, and preferably minimize, a signal level of the beamformed audio signal of the corresponding noise cancelling beamformer 105.
the adaptation circuit 109/ second adaptation controller 113 is arranged to adapt/change/modify the adaptive beamformer 107 to increase, and typically maximize (under a constraint on the coefficients), the signal level of the output audio signal.
the signal level measured/considered by the adaptation circuit 109 may be an amplitude, power, and/or energy measure for the corresponding audio signal.
the adaptation circuit 109 is arranged to adapt the coefficients of the adaptive filters 203 to minimize the power/energy of the beamformed audio signals and the filters/weights of the adaptive beamformer 107 to maximize the power/energy of the output audio signal.
the adaptation circuit 109 is further arranged to control/select when the adaptive filters 203 and the adaptive beamformer 107 are adapted and updated.
the adaptation circuit 109 is specifically arranged such that only either the adaptive filters 203 or the adaptive beamformer 107 is adapted or updated but that at no time are both the adaptive filters 203 and the adaptive beamformer 107 adapted at the same time. This ensures that the synergy between the operations and the adaptations thereof is exploited to generate an output audio signal with improved properties.
the audio apparatus of FIG. 1 comprises an audio source detector 115 which is arranged to detect a presence of a (desired) audio source in the audio environment/audio signals and the adaptation circuit 109 is arranged to determine when to adapt the adaptive beamformer 107 dependent on whether a desired audio source is detected or not.
the audio source detector 115 may in many embodiments be arranged to determine an estimate of a property/parameter, such as a signal level/amplitude/power/energy estimate/measure, of at least one of the input audio signals, the beamformed audio signals, and/or the output audio signal. If the determined estimate meets a criterion, the audio source detector 115 designates that a desired audio source is present and otherwise it designates that a desired audio source is not present. The audio source detector 115 may for example generate a binary control signal indicating whether a desired audio source is considered to be present or not.
a property/parameter such as a signal level/amplitude/power/energy estimate/measure
the audio source detector 115 may determine if the adaptive beamformer 107 is updated/adapted.
the adaptive beamformer 107 could potentially be continuously adapted leading to always trying to find a coherent source. However, to obtain a more stable signal, the adaptation is controlled and not performed continuously. This is especially helpful in avoiding the adaptive beamformer 107 from diverging in case of e.g. (short) speech pauses.
the operation of detecting a presence of a desired audio source is by itself a standard operation in the field and for beamformers, and that many suitable different algorithms, techniques, and criteria are known to the skilled person. It will also be appreciated that it is generally not required that the audio source detector 115 identifies a specific source (or identity thereof) or that the detection is particularly accurate or reliable. Rather, in many embodiments, a relatively rough detection that there is a likelihood that an audio source is present may be sufficient. For example, in many embodiments, a simple detection that the output audio signal (or e.g. one of the beamformed audio signal) has an amplitude above a given threshold may be sufficient.
the audio source detector 115 could be based on measuring variation in the power of one or all of the microphone signals (and/or possibly in one or all of the noise-cancelling beamformer outputs). When the variation in the power raises above a threshold, the adaptive beamformer 107 may be allowed to update.
suitable approaches for detecting a desired audio source include measuring variation in the power of one or all of the noise-cancelling beamformer outputs or a neural network specifically trained to detect speech presence. Such a network could also be using an enrollment to target a specific speaker among multiple speakers.
the audio source detector 115 is coupled to the adaptation circuit 109 (and specifically the first and second adaptation controllers 111, 113) which is arranged to control when the adaptive filters 203 and adaptive beamformer 107 are adapted based on whether a desired audio source is detected to be present or not.
the second adaptation controller 113 is arranged to control the adaptation of the adaptive beamformer 107 such that this is performed when the presence of the desired audio source is detected. Further, it may control the adaptation such that the adaptive beamformer 107 is not adapted when the desired audio source is not detected to be present.
the adaptation circuit 109 controls the adaptation of the adaptive beamformer 107 such that it will adapt only when the desired audio source is present, and it may further adapt to increase/maximize the signal level of the generated output audio signal.
the adaptive beamformer 107 may be adapted towards maximizing the signal level when the desired audio signal is present.
the first adaptation controller 111 is arranged to adapt the adaptive filters 203 such that no adaptation is performed when the desired audio source is detected. Accordingly, when the desired audio is present, the adaptive beamformer 107 is adapted but the adaptive filters 203 are not adapted. This may ensure that the adaptive filters 203 are not adapted to the desired audio source but rather to other components in the audio environment, and thus specifically may adapt when a noise source is present/active.
the first adaptation controller 111 may be arranged to adapt the adaptive filters 203 whenever the audio source detector 115 indicates that the desired audio source is not present/on. This may for example allow highly efficient adaptation and operation in scenarios where there is a constant noise source (e.g. a fan) which however is less audible than the desired audio source.
a constant noise source e.g. a fan
the adaptation of the adaptive filters 203 may further be subject to a consideration of whether a noise source is present or not.
the audio apparatus of FIG. 1 further comprises a noise source detector 117 which is arranged to detect a presence of a noise source in the audio environment/audio signals and the adaptation circuit 109 is arranged to determine when to adapt the adaptive filters 203 dependent on whether a noise source is detected or not.
a noise source detector 117 which is arranged to detect a presence of a noise source in the audio environment/audio signals and the adaptation circuit 109 is arranged to determine when to adapt the adaptive filters 203 dependent on whether a noise source is detected or not.
the noise source detector 117 may in many embodiments be arranged to determine an estimate of a property/parameter, such as a signal level/amplitude/power/energy estimate/measure, of one or more of the noise reference signals generated by the beamform circuits 201. If the determined estimate(s) meet(s) a criterion, the noise source detector 117 designates that a noise source is present and otherwise it designates that a noise source is not present. The noise source detector 117 may for example generate a binary control signal indicating whether a noise source is considered to be present or not.
a property/parameter such as a signal level/amplitude/power/energy estimate/measure
the operation of detecting a presence of a noise source is by itself a standard operation in the field and for beamformers, and that many suitable different algorithms, techniques, and criteria are known to the skilled person. It will also be appreciated that it is generally not required that the noise source detector 117 identifies a specific source (or identity thereof) or that the detection is particularly accurate or reliable. Rather, in many embodiments, a relatively rough detection that a noise source is present which may be a suitable noise source is sufficient. For example, in many embodiments, a simple detection that the noise reference signal of one of the beamform circuits 201 has an amplitude above a given threshold may be sufficient.
the audio apparatus may include a noise tracking beamformer (which e.g. may be done allocating one of the noise cancelling beamformers 105) which specifically tracks an audio source that is a noise source in the audio environment.
the noise tracking beamformer may in such a case simply detect the presence of the noise source dependent on the level of the beamformed audio signal generated by the noise tracking beamformer. For example, if the signal level exceeds a threshold, the noise source is designated to be present/on and otherwise it is designated as not being present.
the noise source detector 117 may determine/estimate whether a point noise interferer is active or not.
the specific algorithm and operation of the noise source detector 117 may strongly depend on the type of interference that is expected/assumed/to be detector.
a threshold applied to a smoothed power estimate can e.g.
a neural network trained with speech and several (different) types of non-speech, such as music, can be used.
the neural network provides for each frame an indication whether the frame contains speech or noise.
the adaptation of the adaptive filters 203 is dependent on whether a noise source is detected or not.
the adaptation circuit 109/ first adaptation controller 111 may be arranged to only adapt the adaptive filters 203 when a noise source is detected to be present.
the adaptation circuit 109 may be arranged to adapt/update the adaptive beamformer 107 but not the adaptive filters 203 when a desired audio source is detected, and to adapt/update the adaptive filters 203 but not the adaptive beamformer 107 when no desired audio source is detected but a noise source is detected to be present. If neither a desired audio source nor a noise source is detected, neither the adaptive beamformer 107 nor the adaptive filters 203 are updated.
the adaptation circuit 109 may accordingly be arranged to adapt the adaptive beamformer 107 in response to the detection of the presence of the desired audio source and to inhibit/prevent adaptation of the adaptive filters 203 in response to the detection of the presence of the audio source.
the adaptation circuit 109 may be arranged to adapt the adaptive filters 203 in response to the detection of the presence of the noise source and when no desired audio source is detected and to inhibit/prevent adaptation of the adaptive filters 203 when the noise source is not detected.
auxiliary power provision is typically performed in the frequency domain with individual processing in different frequency intervals/bins.
the signals and the operation are divided into different frequency intervals/bins with the processing in each frequency interval being individually performed and adapted in each frequency interval.
the beamform operation of the adaptive beamformer 107 uses different/individual combination parameters (weights/beamform filters) for the different frequency intervals and the beamform operation may comprise combining the beamformed audio signals using different combination parameters in different frequency intervals, and the adaptation circuit 109 is arranged to individually adapt the combination parameters in different frequency intervals.
the adaptation circuit 109 may adapt the adaptive beamformer 107 such that it applies a different combination and beamforming in different frequency intervals.
the adaptation circuit is arranged to individually adapt the adaptive filters (203) of the noise cancelling beamformers (105) in different frequency intervals.
the adaptive filters 203 may for example be arranged to apply different weights/coefficients (typically complex) in different frequency intervals/bins with these being adapted individually in each frequency interval.
the adaptation circuit 109 may be arranged to adapt the beamforming with this being adapted individually in each frequency interval/bin. Thus, a different beamforming may be performed in each frequency interval and thus different beams may be formed for different frequency intervals.
the receiver 101 may comprise a segmenter which is arranged to segment the input audio signals into time segments.
the segmentation may typically be a fixed segmentation into time segments of a fixed and equal duration such as e.g. a division into time segments/intervals with a fixed duration of between 10-20msecs.
the segmentation may be adaptive for the segments to have a varying duration.
the input audio signals may have a varying sample rate and the segments may be determined to comprise a fixed number of samples.
the segmentation may typically be into segments with a given fixed number of time domain samples of the input signals.
the segmenter 101 may be arranged to divide the input signals into consecutive segments of e.g., 256 or 512 samples.
the receiver 101 may be arranged to generate a frequency bin representation of the input audio signals and the input signals are subsequently typically represented in the frequency domain by a frequency bin representation.
the audio apparatus may be arranged to perform frequency domain processing of the frequency domain representation of the input audio signals.
the signal representation and processing are based on frequency bins and thus the signals are represented by values of frequency bins and these values are processed to generate frequency bin values of the output signals.
the frequency bins have the same size, and thus cover frequency intervals of the same size. However, in other embodiments, frequency bins may have different bandwidths, and for example a perceptually weighted bin frequency interval may be used.
the input audio signals may already be provided in a frequency representation and no further processing or operation is required. In some such cases, however, a rearrangement into suitable segment representations may be desired, including e.g. using interpolation between frequency values to align the frequency representation to the time segments.
a filter bank such as a Quadrature Mirror Filter, QMF
QMF Quadrature Mirror Filter
a filter bank such as a Quadrature Mirror Filter, QMF
a Discrete Fourier Transform DFT
FFT Fast Fourier Transform
an adaptive beamformer 107 may combine the outputs of multiple noise-cancelling beamformers 105.
the noise-cancelling beamformers 105 may be arranged to cancel a strong point noise interference and the adaptive beamformer 107 may be adapted to focus on a desired audio source which specifically may be desired speech.
the adaptation circuit 109 controls the adaptation of the adaptive filters 203 to cancel a strong interference and the adaptive beamformer 107 to concentrate desired audio/speech energy by adapting the adaptive filters 203 to minimize the signal energy of the beamformed audio signals and adapting the adaptive beamformer 107 to maximize the signal level of the output audio signal with the adaptations being restricted to situations when a strong noise source is detected without a desired audio source being present and situations when a desired audio source is detected to be present respectively.
the noise cancelling beamformers 105 may be arranged to cover a specific spatial direction or location and may or may not be adaptive beamform circuits 201.
Each beamform circuit 201 may generate a primary output signal (a beamformed signal) and at least one noise reference signal.
the primary output signal contains audio from the target direction or location and the noise reference signal(s) may contain(s) audio from other directions.
An adaptive filter is applied to the at least one noise reference signal and its output is subtracted from the primary output signal of the beamform circuit 201 which results in a cancelling of coherent noise sources.
the beamformed audio signals of the noise cancelling beamformers 105 may be combined by the adaptive beamformer 107 which collects/combines the desired audio/speech energy from these.
the adaptive beamformer 107 may specifically maximize the desired speech energy in its output by applying a time-frequency dependent weighting of the beamformed audio signals.
the approach may be particularly suitable for many speech enhancement systems and applications.
there may in addition to diffuse noise in the audio environment also be a correlated point noise interferer at a relatively loud level.
This may specifically be audio sources (like radios and TV), where no audio reference is available for echo cancellation.
the audio sources may be close to the microphones and the audio capturing device.
SNR Signal to Noise Ratio
the SNR Signal to Noise Ratio
the SNR Signal to Noise Ratio
the described approach may allow a very efficient cancellation of noise in such scenarios. For example, a significant amount of noise may be removed by the noise cancellation being performed by the noise cancelling beamformers 105. However, such processing may also result in increased reverberation and in particular if multiple captures and beamform signals are considered.
the described approach, and in particular the frequency selective combination performed by the adaptive beamformer 107 as described may allow an efficient noise cancellation while reducing the amount of reverberation in the output signal. Indeed, in many embodiments, a result of the described approach may be that speech in the output audio signal of the adaptive beamformer 107 is dereverberated compared to the speech signals of the beamformed signals input to the adaptive beamformer 107.
the beamform circuits 201 of the noise cancelling beamformers 105 may in some embodiments be adaptive beamformers that may e.g. each seek to capture and track an individual noise source.
the adaptation of a given beamform circuit 201 of a given noise cancelling beamformer 105 may be such that it adapts the beamform parameters (weights, filters applied to the input signals) to seek to maximize the signal level of the generated beamform signal z(n) of the beamform circuit 201.
the adaptation may typically seek to find a local maxima (rather than a global maximum).
different beamform circuits 201 may be assigned to track a different audio source, such as a specific speaker, and by gradually updating/adapting the beamform parameters to maximize the signal level of the generated beamformed signal, the individual beamform circuit 201 will continue to track that audio source/speaker even if more dominant audio sources are present in the audio environment.
a different audio source such as a specific speaker
the adaptation of the main beam/beamform signal z(n) may also automatically update a beamforming generating the noise reference signal x(n) for example by automatically forming a notch in the direction of the main beam for the beamformed signal.
the different beams may overlap and multiple beamformed audio signals may include reflection components from the same speaker/audio source.
the described hierarchical approach provides a particularly advantageous output audio signal to be generated with reduced reverberation and reflection components from the different beamform audio signals.
the system may include (at least) three different adaptations to detect and track audio sources, including a plurality of initial adaptive beamform operations, an adaptive noise cancelling, and an additional output adaptive beamformer 107 which combines the resulting signals from the previous operations to generate an output audio signal.
one, more, or all of the beamform circuits 201 may be non-adaptive and form a fixed beam.
the beamform circuits 201 of the noise-cancelling beamformers 105 may each be arranged to fixedly a certain part of a region of interest (ROI).
the plurality of noise cancelling beamformers 105 may be arranged with fixed beams such that they together cover the entire ROI.
the beamformers may or may not be overlapping, which can differ per frequency.
delay-and-sum beamformers can be designed that primarily focus on the directions with an azimuth of e.g. 0, 90, 180 and 270 degrees. These beamformers typically have a target/unit response in their target direction and a lower response in any other direction, where the response gradually decreases for increasing angles from the target direction, with different slopes per frequency.
delay-and-sum beamformers can be designed for example such that their primary focus is on the directions with an azimuth of e.g. 0, 45, 90, 135 and 180 degrees, where 0 degrees is along the same axis as the placement of the microphones.
azimuth e.g. 0, 45, 90, 135 and 180 degrees
0 degrees is along the same axis as the placement of the microphones.
a signal from 90 degrees is perceived the same as a signal from 270 degrees and 45 degrees corresponds to an angle of 315 degrees.
the response of a beamformer as function of an angle is also called a beampattern.
Example beampatterns for four beamformers that in total cover 360 degrees is depicted in FIG. 3 .
the beamformer In the GSC structure, the beamformer generates a primary signal z(n) and at least one noise reference x(n).
Noise reference signals are typically generated via a blocking matrix or filter.
the blocking matrix should filter out the signals coming from the target direction of the beamformer and let other signals pass through.
An adaptive filter that filters the noise reference signals completes the GSC structure. The adaptive filter is updated in such a way that when its output is subtracted from the primary output of the beamformer, leading to output signal r(n), coherent noises present in both the primary and noise reference outputs are cancelled. Via delay-and-sum beamforming, reflections of desired speech will also appear in the noise references.
the filters are not updated when desired speech is present.
the desired signal is injected into or added to the output of the noise-cancelling beamformers.
the desired speech in the output signal r(n) is not cancelled; however, the desired speech in the output becomes more reverberated.
the adaptive beamformer 107 consists of a beamformer such as is described in US 7 146 012 or US 7 602 926 .
the inputs to this adaptive beamformer are the outputs of the at least two noise-cancelling beamformers.
the beamformer When the beamformer is allowed to update, it collects speech energy from its inputs by applying frequency dependent weights to its input.
the beamformer optimally weighs the different inputs to maximize its output power.
the hierarchical beamforming approach is particularly advantageous for such a fixed beam approach as it effectively and advantageously in many scenarios can handle and combine signals from different beams to allow an improved capture of individual sources as they move in the environment. It may in particular provide reduced sensitivity to the movement and where the audio source is located with respect to the formed beams.
the target audio source may in a fixed beam arrangement be substantially offset with respect thereto, and the described approach may improve capture in such scenarios.
the adaptive beamformer 107 may adapt such that the contributions to the output audio signal from the different beams are optimally distributed over the overlapping beams that capture the desired source.
the target audio source is sufficiently central within one beam, the weighting of this beam/beamformed audio signal may be substantially increased and indeed the adaptive beamformer 107 may fully focus on that beam.
the adaptation is performed separately and individually in each frequency interval.
FIG. 4 illustrates an audio apparatus corresponding to that of FIG. 1 but with a number of additional optional functions and elements.
the audio apparatus may in particular comprise the following:
the adaptation circuit 109 may control the adaption and operation of the different beamformers and circuits. For example, it may proceed to determine in which beam a (far distant) speaker is present and which beam has the strongest speech (e.g. based on the output of the free running beamformer). It may then check if there is a focused beam (a beam of a noise cancelling beamformer 105) close to the free running beam (the beam of the free running beamformer) and create a new candidate speech beam if not (using one of the noise cancelling beamformers 105). It may then proceed to evaluate whether this candidate beam contains speech or noise, and it may create a new noise beam if necessary (e.g. using a noise cancelling beamformer 105).
the audio apparatus may comprise a number of beamformers/noise cancelling beamformers 105 which can be allocated to function as noise beamformers or as noise cancelling beamformers that extract a desired audio source/speech.
the audio apparatus may further comprise an output circuit which may in some embodiments and scenarios be arranged to postprocess the output audio signal generated by the adaptive beamformer 107.
the output circuit is specifically an output selector 407 which is arranged to select between the output audio signal generated by the adaptive beamformer 107 and one (or more) of the beamformed audio signals generated by the noise cancelling beamformers 105.
the audio apparatus may include a speech recognition circuit 409 arranged to perform automatic speech recognition on the output audio signal.
the audio apparatus may alternatively or additionally include a communicator 411 which is arranged to generate an output data signal including the generated output audio signal and transmit the output data signal to a remote entity.
the audio apparatus may include functionality for performing automatic speech recognition and also functionality for transmitting speech/audio to a remote destination, e.g. as part of a teleconferencing application.
a smartphone, personal computer, or communication device may comprise a number of different functions that may use the described audio capturing approach for a range of different applications.
the selector 407 may be arranged to select between the output audio signal from the adaptive beamformer 107 and a beamformed audio signal of one of the noise cancelling beamformers 105 depending on whether the signal is for automatic speech recognition or for communication.
the selector 407 may thus be arranged to switch the output audio signal between being generated by applying a beamform operation to the beamformed audio signals and the output audio signal being generated as a beamformed audio signal of one of the plurality of noise cancelling beamformers in response to a selection between the communication circuit and the speech recognition circuit as being active/being the application/circuit receiving and using the output signal.
the selector 407 may be arranged to select the output signal to be the output audio signal from the adaptive beamformer 107 when the automatic speech recognition circuit 409 is receiving the output signal and to select the output from one of the noise cancelling beamformers 105 (specifically the one tracking a desired speaker) when the output signal is fed to/used by the communication circuit 411.
Such an approach may provide advantageous operation optimized for the specific application in many embodiments. It may allow the system to benefit from having the same processing chain and control logic for both communication and automatic speech recognition applications, leading to an efficient solution for two use cases in parallel. It also enables applications such as transcription, diarization and automatically taking minutes in a communication scenario such as an online meeting.
the adaptive beamformer 107 may degrade at least partially performance of a correctly focused controlled beamformer since the dereverberated speech therefrom is mixed with reverberant speech in the signals of other noise cancelling beamformers 105. Due to the quality of modern automatic speech recognition engines, this may still provide acceptable performance. However, for e.g. communication use cases, it may degrade the quality of the output signal.
the described selection approach may accordingly provide improved performance in many embodiments.
the operation of the selector 407 is typically application specific. For communication use cases, all possible controlled beam and free-running beam outputs can e.g. be presented to the user/communication circuit, and a selection can be made based on a criterion defined by the user, e.g. the strongest speaker can be selected. For automatic speech recognition applications, the output of the adaptive beamformer 107 may be chosen when a point noise interferer is detected; otherwise, a selection can be made based on a criterion defined by the user, e.g. based on the strongest speaker.
the output selector 407 may for an automatic speech recognition application be arranged to select one of the free running or noise cancelling beamformer outputs in case no point interferer is detected. When a point interferer is detected, the output of the adaptive beamformer 107 is instead selected. Alternatively or additionally, a strategy may be in place to provide for an output for the communication use case.
An issue with the system of FIG. 4 is that in use cases with strong (continuous) point noise interferers, it may at low SNR levels become more and more difficult to focus the free running beam (the beam of the free running beamformer 403) and the controlled beams (the beams of the noise cancelling beamformers 105) accurately on desired speakers, although the free running beam is often still able to detect a new speaker.
Improved noise reduction by extending the length of the adaptive spatial decorrelator 401 may improve the performance in case of a strong interferer, but it has drawbacks in that latency is increased and convergence of the adaptive spatial decorrelator 401 becomes slower. Both these drawbacks degrade performance for the common use cases with low, moderate, or even absent point noise interferers. These circumstances are typical for communication, and in communication scenarios limited latency is essential.
noise cancelling beamformers 105 When noise cancelling beamformers 105 are not focused on speakers, desired speech in the beamformer outputs becomes more and more diffused/reverberated and noisy. Also, modern automatic speech recognition engines are increasingly less sensitive to reverberant speech and (continuous) noise, as long as the desired speech is not distorted, e.g., by switching/selection artifacts or by a post-processor that performs (non-linear) noise suppression, and the SNR is sufficiently positive.
the SNR and reverberation may typically be brought to acceptable levels for an automatic speech recognition engine to operate efficiently even in case where (strong) point noise interferers are present.
the noise cancelling beamformers 105 of FIG. 2 comprise an (optional multi-channel) noise-cancelling adaptive filter that can be used to remove coherent noise from the beamformer output.
the adaptive filter further removes the coherent noise that originates from the interferer.
the additional reduction of coherent noise with the adaptive filter can be achieved by updating the adaptive filter parameters when the adaptation circuit 109 does not detect speech in the corresponding noise cancelling beamformer 105.
each 'unfocused' (not currently tracking a strong audio source) noise cancelling beamformer output contains diffused or reverberated desired speech.
the reverberation originates in the first place from a mismatched adaptive beamformer that therefore does not perform the intended/desired dereverberation.
desired speech that leaks into the noise references (due to the mismatched adaptive beamformer) is injected into the beamformer output via the adaptive filter.
the noise-cancelling adaptive filter is not allowed to update when speech is detected; consequently, the desired speech is not actively cancelled by the adaptive filter.
an adaptive beamformer 107 By applying an adaptive beamformer 107 to the outputs of the noise-reducing noise cancelling beamformers 105, some of the reverberation can be undone if the adaptive beamformer 107 can be focused on the desired speaker. This can be achieved by updating the parameters of the adaptive beamformer 107 when the adaptation circuit 109 determines that a (far distant) speaker is present.
the adaptation circuit 109 may provide/control two modes of operation for updating these cascaded adaptive modules.
the adaptive beamformer 107 is the last step in the algorithm.
the adaptive beamformer 107 is able to focus on the desired speech, providing some dereverberation, and it further produces a relatively constant noise.
the adaptive beamformer 107 does not suffer from changes in the other adaptive components.
the output selector 407 may in some cases select the output signal to be generated from a beamformed audio signal generated by one of the noise cancelling beamformers 105 rather than the combined signal of the adaptive beamformer 107.
the other controlled beams are not focused on that speaker and in the absence of a strong interferer this may provide highly efficient dereverberation and noise reduction.
the operation may typically be based on block processing. For audio signals at 16 kHz, typically frames of 256 samples can be used. For each frame the outputs of the spatial decorrelator, the free running beamformer and the noise cancelling beamformer 105 are calculated.
the adaptation circuit 109 may further determine which functions to adapt/update on a per frame basis including determining which beam is active and if adaptive beamformer 107 should be updated. It may further determine if a new controlled beam should be created. Further, it may determine whether a new noise beam should be created and whether spatial decorrelator and noise beamformer(s) should update.
the noise-cancelling adaptive filters work in combination with the adaptive spatial decorrelator 401 and specifically when the filter length is larger than the length of the adaptive spatial decorrelator 401 or when the domain of operation differs from the domain where the adaptive spatial decorrelator 401 is operating, e.g., when the noise references are delayed compared to the primary output of the beamformer such that the adaptive filter operates with delays larger than the delays in the adaptive spatial decorrelator 401.
the noise references are delayed compared to the primary output of the beamformer such that the adaptive filter operates with delays larger than the delays in the adaptive spatial decorrelator 401.
the adaptive beamformer 107 may receive the beamformed audio signals from all of the noise cancelling beamformers 105. However, in other cases, only some of the noise cancelling beamformers 105 may generate signals used by the adaptive beamformer 107.
the audio apparatus may determine a confidence measure for each beamformed audio signals which is indicative of a likelihood of the beamformed audio signal capturing a (desired) speech/audio source.
the audio apparatus may for each noise cancelling beamformers 105 determine whether it is currently tracking a speaker, is not tracking anything or is tracking a noise source, or whether it is currently a candidate beam for which it is not known whether the captured signal is for a valid and desired audio source or not.
the adaptive beamformer 107 may be arranged to exclude beamformed audio signals from noise cancelling beamformers for which the confidence measure does not meet a criterion.
the adaptive beamformer 107 may in many embodiments only include signals from noise cancelling beamformers 105 for which it is considered reliable that they are tracking a valid speech signal.
this beamformer can be ignored as input to the adaptive beamformer 107 in order to reduce noise in the adaptive beamformer 107 and to keep a stable output signal during the attack of new speech or noise, i.e., during the assignment of a new (candidate) beam.
FIG. 5 is a block diagram illustrating an example processor 500 according to embodiments of the disclosure.
Processor 500 may be used to implement one or more processors implementing an apparatus as previously described or elements thereof (including in particular the beamformers as described).
Processor 500 may be any suitable processor type including, but not limited to, a microprocessor, a microcontroller, a Digital Signal Processor (DSP), a Field ProGrammable Array (FPGA) where the FPGA has been programmed to form a processor, a Graphical Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC) where the ASIC has been designed to form a processor, or a combination thereof.
DSP Digital Signal Processor
FPGA Field ProGrammable Array
GPU Graphical Processing Unit
ASIC Application Specific Integrated Circuit
the processor 500 may include one or more cores 502.
the core 502 may include one or more Arithmetic Logic Units (ALU) 504.
ALU Arithmetic Logic Units
the core 502 may include a Floating Point Logic Unit (FPLU) 506 and/or a Digital Signal Processing Unit (DSPU) 508 in addition to or instead of the ALU 504.
FPLU Floating Point Logic Unit
DSPU Digital Signal Processing Unit
the processor 500 may include one or more registers 512 communicatively coupled to the core 502.
the registers 512 may be implemented using dedicated logic gate circuits (e.g., flip-flops) and/or any memory technology. In some embodiments the registers 512 may be implemented using static memory.
the register may provide data, instructions and addresses to the core 502.
processor 500 may include one or more levels of cache memory 510 communicatively coupled to the core 502.
the cache memory 510 may provide computer-readable instructions to the core 502 for execution.
the cache memory 510 may provide data for processing by the core 502.
the computer-readable instructions may have been provided to the cache memory 510 by a local memory, for example, local memory attached to the external bus 516.
the cache memory 510 may be implemented with any suitable cache memory type, for example, Metal-Oxide Semiconductor (MOS) memory such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), and/or any other suitable memory technology.
MOS Metal-Oxide Semiconductor
the processor 500 may include a controller 514, which may control input to the processor 500 from other processors and/or components included in a system and/or outputs from the processor 500 to other processors and/or components included in the system. Controller 514 may control the data paths in the ALU 504, FPLU 506 and/or DSPU 508. Controller 514 may be implemented as one or more state machines, data paths and/or dedicated control logic. The gates of controller 514 may be implemented as standalone gates, FPGA, ASIC or any other suitable technology.
the registers 512 and the cache 510 may communicate with controller 514 and core 502 via internal connections 520A, 520B, 520C and 520D.
Internal connections may be implemented as a bus, multiplexer, crossbar switch, and/or any other suitable connection technology.
Inputs and outputs for the processor 500 may be provided via a bus 516, which may include one or more conductive lines.
the bus 516 may be communicatively coupled to one or more components of processor 500, for example the controller 514, cache 510, and/or register 512.
the bus 516 may be coupled to one or more components of the system.
the bus 516 may be coupled to one or more external memories.
the external memories may include Read Only Memory (ROM) 532.
ROM 532 may be a masked ROM, Electronically Programmable Read Only Memory (EPROM) or any other suitable technology.
the external memory may include Random Access Memory (RAM) 533.
RAM 533 may be a static RAM, battery backed up static RAM, Dynamic RAM (DRAM) or any other suitable technology.
the external memory may include Electrically Erasable Programmable Read Only Memory (EEPROM) 535.
the external memory may include Flash memory 534.
the External memory may include a magnetic storage device such as disc 536. In some embodiments, the external memories may be included in a system.
the invention can be implemented in any suitable form including hardware, software, firmware or any combination of these.
the invention may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors.
the elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.

Landscapes

Engineering & Computer Science (AREA)
Acoustics & Sound (AREA)
Physics & Mathematics (AREA)
Multimedia (AREA)
Signal Processing (AREA)
Health & Medical Sciences (AREA)
Audiology, Speech & Language Pathology (AREA)
Computational Linguistics (AREA)
Human Computer Interaction (AREA)
Otolaryngology (AREA)
General Health & Medical Sciences (AREA)
Quality & Reliability (AREA)
Circuit For Audible Band Transducer (AREA)

EP24182816.9A 2024-06-18 2024-06-18 Appareil audio et son procédé de fonctionnement Pending EP4668265A1 (fr)

Priority Applications (2)

Application Number	Priority Date	Filing Date	Title
EP24182816.9A EP4668265A1 (fr)	2024-06-18	2024-06-18	Appareil audio et son procédé de fonctionnement
PCT/EP2025/065962 WO2025261811A1 (fr)	2024-06-18	2025-06-09	Appareil audio et son procédé de fonctionnement

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
EP24182816.9A EP4668265A1 (fr)	2024-06-18	2024-06-18	Appareil audio et son procédé de fonctionnement

Publications (1)

Publication Number	Publication Date
EP4668265A1 true EP4668265A1 (fr)	2025-12-24

Family

ID=91585428

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP24182816.9A Pending EP4668265A1 (fr)	2024-06-18	2024-06-18	Appareil audio et son procédé de fonctionnement

Country Status (2)

Country	Link
EP (1)	EP4668265A1 (fr)
WO (1)	WO2025261811A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US7146012B1 (en)	1997-11-22	2006-12-05	Koninklijke Philips Electronics N.V.	Audio processing arrangement with multiple sources
US7602926B2 (en)	2002-07-01	2009-10-13	Koninklijke Philips Electronics N.V.	Stationary spectral power dependent audio enhancement system
WO2009132646A1 (fr) *	2008-05-02	2009-11-05	Gn Netcom A/S	Procédé de combinaison d’au moins deux signaux audio et système de microphones comportant au moins deux microphones
US20150172807A1 (en) *	2013-12-13	2015-06-18	Gn Netcom A/S	Apparatus And A Method For Audio Signal Processing

2024
- 2024-06-18 EP EP24182816.9A patent/EP4668265A1/fr active Pending
2025
- 2025-06-09 WO PCT/EP2025/065962 patent/WO2025261811A1/fr active Pending

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US7146012B1 (en)	1997-11-22	2006-12-05	Koninklijke Philips Electronics N.V.	Audio processing arrangement with multiple sources
US7602926B2 (en)	2002-07-01	2009-10-13	Koninklijke Philips Electronics N.V.	Stationary spectral power dependent audio enhancement system
WO2009132646A1 (fr) *	2008-05-02	2009-11-05	Gn Netcom A/S	Procédé de combinaison d’au moins deux signaux audio et système de microphones comportant au moins deux microphones
US20150172807A1 (en) *	2013-12-13	2015-06-18	Gn Netcom A/S	Apparatus And A Method For Audio Signal Processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JACEK P DMOCHOWSKI ET AL: "Decoupled Beamforming and Noise Cancellation", IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, IEEE, USA, vol. 55, no. 1, 1 February 2007 (2007-02-01), pages 80 - 88, XP011155766, ISSN: 0018-9456 *
KHANNA R ET AL: "ADAPTIVE BEAM FORMING USING A CASCADE CONFIGURATION", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, IEEE INC. NEW YORK, USA, vol. 31, no. 4, 1 August 1983 (1983-08-01), pages 940 - 945, XP001154480, ISSN: 0096-3518, DOI: 10.1109/TASSP.1983.1164157 *

Also Published As

Publication number	Publication date
WO2025261811A1 (fr)	2025-12-26

Legal Events

Date

Code

Title

Description

2025-11-21

PUAI

Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012