WO2022097919A1 - 뉴럴 네트워크를 이용한 빔포밍 방법 및 빔포밍 시스템 - Google Patents
뉴럴 네트워크를 이용한 빔포밍 방법 및 빔포밍 시스템 Download PDFInfo
- Publication number
- WO2022097919A1 WO2022097919A1 PCT/KR2021/013328 KR2021013328W WO2022097919A1 WO 2022097919 A1 WO2022097919 A1 WO 2022097919A1 KR 2021013328 W KR2021013328 W KR 2021013328W WO 2022097919 A1 WO2022097919 A1 WO 2022097919A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- beamforming
- sound signal
- microphone
- phase difference
- fourier transform
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B7/00—Radio transmission systems, i.e. using radiation field
- H04B7/02—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
- H04B7/04—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
- H04B7/06—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
- H04B7/0613—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission
- H04B7/0615—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal
- H04B7/0617—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal for beam forming
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B7/00—Radio transmission systems, i.e. using radiation field
- H04B7/02—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
- H04B7/04—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
- H04B7/06—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
- H04B7/0613—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission
- H04B7/0682—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission using phase diversity (e.g. phase sweeping)
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
- H04R1/406—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers
- H04R3/005—Circuits for transducers for combining the signals of two or more microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/40—Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
- H04R2201/401—2D or 3D arrays of transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/40—Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
- H04R2201/403—Linear arrays of transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
- H04R2430/03—Synergistic effects of band splitting and sub-band processing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
- H04R2430/20—Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
- H04R2430/23—Direction finding using a sum-delay beam-former
Definitions
- the present invention relates to a beamforming method and a beamforming system using a neural network.
- the cocktail party effect refers to a phenomenon in which party attendees selectively focus and accept a conversation with an interlocutor despite being in a room with loud ambient noise.
- a machine namely beamforming
- attempts have been made to use neural networks to solve the cocktail party problem. Improving the performance of the beamforming technique is closely related to the performance of audio-related electronic products, and in particular, since it is related to hearing aids, it can also have the meaning of a very important social issue.
- Beamforming may refer to a process of reconstructing a target signal by analyzing a sound acquired using two or more microphones. For this, a technique of minimizing the volume of incoming sound while satisfying a given constraint, such as creating and combining artificial delay times for signals from each microphone, has been used for a long time. Recently, studies on performance improvement of a minimum variance distortionless response (MVDR) beamformer using a neural network or a training method of a neural network model implementing the beamformer are being actively studied.
- MVDR minimum variance distortionless response
- the problem to be solved by the present invention is to use a neural network to overcome the limitation that a large amount of computation is required to obtain spatial information in rule-based beamforming, but to design a neural network structure to be optimized for beamforming. and to provide a beamforming method and a beamforming system using a neural network capable of minimizing the amount of computation.
- a beamforming method includes receiving a first sound signal and a second sound signal using a first microphone and a second microphone spaced apart from the first microphone by a predetermined distance, respectively. step; obtaining a Fourier transform result for each of the first sound signal and the second sound signal; obtaining a phase difference between the first sound signal and the second sound signal from the Fourier transform result; calculating the phase difference by inputting the phase difference into a beamforming model using a neural processor; performing elemental multiplication on the operation result of the neural processor and the Fourier transform result of the first sound signal; and outputting the elemental product result.
- the performing of the elemental product may further include applying a mask to the operation result before performing the elemental product.
- the performing of the elemental product may further include performing gain control after performing the elemental product.
- the predetermined distance may be 10 cm to 14 cm.
- the beamforming method may further include learning the beamforming model by using the phase difference.
- a beamforming system includes: a first microphone for receiving a first sound signal; a second microphone spaced apart from the first microphone by a predetermined distance to receive a second sound signal; a first STFT module for obtaining a Fourier transform result for the first sound signal; a second STFT module for obtaining a Fourier transform result for the second sound signal; a phase difference obtaining module for obtaining a phase difference between the first sound signal and the second sound signal from the Fourier transform result; a neural processor that receives the phase difference and performs a neural network operation using a beamforming model; an elemental product module for performing elemental product on the operation result of the neural processor and the Fourier transform result of the first sound signal; and an output module for outputting the result of the element product.
- the beamforming system may further include a masking module for applying a mask to the operation result before performing the elemental product.
- the beamforming system may further include a gain control module that performs gain control after performing the elemental product.
- the predetermined distance may be 10 cm to 14 cm.
- the beamforming system may further include a learning model for learning the beamforming model by using the phase difference.
- the present invention it is possible to restore the voice received from the microphone using only the phase difference without the need to calculate the steering vector and spatial correlation matrix for various noise environments, so that beamforming can be efficiently implemented.
- FIG. 1 is a view for explaining a beamforming system according to an embodiment of the present invention.
- FIG. 2 is a diagram for explaining a computing device for implementing a beamforming apparatus according to embodiments of the present invention.
- FIG. 3 is a diagram for explaining a beamforming method according to an embodiment of the present invention.
- FIG. 4 is a view for explaining a beamforming method according to an embodiment of the present invention.
- FIG. 5 is a diagram for explaining an example of an implementation of a beamforming method according to an embodiment of the present invention.
- FIG. 6 is a diagram for explaining a beamforming system according to an embodiment of the present invention.
- FIG. 7 is a view for explaining a beamforming system according to an embodiment of the present invention.
- FIG. 8 is a diagram for explaining a beamforming system according to an embodiment of the present invention.
- 9 and 10 are diagrams for explaining advantageous effects of a beamforming method and a beamforming system according to embodiments of the present invention.
- FIG. 1 is a view for explaining a beamforming system according to an embodiment of the present invention.
- a beamforming system 1 includes a beamforming device 10 including a first microphone M1 , a second microphone M2 and a connection terminal T, and A monitor 20 may be included.
- the beamforming apparatus 10 may be attached to the monitor 20 to receive sound using the microphones M1 and M2.
- the beamforming apparatus 10 may receive a voice of a person participating in a video conference in front of the monitor 20 using the microphones M1 and M2 .
- the beamforming apparatus 10 may receive a voice of a person participating in a video conference in an environment with a lot of ambient noise.
- the beamforming apparatus 10 may perform beamforming on a sound signal received using the microphones M1 and M2 and then output a beamformed sound signal obtained as a result.
- the beamforming apparatus 10 may discriminate the voice of a person participating in a video conference in an environment with a lot of ambient noise and provide it to another computing device (eg, a personal computer to which the monitor 20 is connected). The computer device can then provide the discerned human voice, for example to other video conference participants.
- a connection terminal T may be used in order for the beamforming device 10 to provide an output signal that identifies a human voice to another computing device, and in this embodiment, the connection terminal T is a Universal Serial Bus (USB) terminal may be, but the scope of the present invention is not limited thereto.
- USB Universal Serial Bus
- the first microphone M1 and the second microphone M2 may be disposed to be spaced apart by a predetermined distance D.
- the first microphone M1 may receive the voice of a person participating in the video conference on the first side (eg, the left side) and ambient noise (ie, the first sound signal), and the second microphone M1 M2) receives the voice of a person participating in the video conference and ambient noise (ie, the second sound signal) on the second side (eg the right side) away from the first microphone M1 by a predetermined distance D can do.
- the predetermined distance D between the first microphone M1 and the second microphone M2 may be 10 cm to 14 cm, preferably 12 cm, but the scope of the present invention is limited thereto. it is not
- FIG. 2 is a diagram for explaining a computing device for implementing a beamforming apparatus according to embodiments of the present invention.
- a computing device for implementing the beamforming apparatus 10 includes a processor 100 , a neural processor 110 , a memory 120 , an output module 130 , and a first It may include a microphone M1 and a second microphone M2, and the corresponding computing device may operate to perform the beamforming method according to embodiments of the present invention.
- the processor 100 , the neural processor 110 , the memory 120 , the output module 130 , the first microphone M1 , and the second microphone M2 may exchange data with each other through the bus 190 .
- the processor 100 performs overall control of the beamforming apparatus 10 , and may perform the functions and methods described herein together with the neural processor 110 or independently of the neural processor 110 .
- the processor 100 may be implemented by various types of processors such as an application processor (AP), a central processing unit (CPU), a graphic processing unit (GPU), and the like, and the scope of the present invention is not limited to a specific processor.
- AP application processor
- CPU central processing unit
- GPU graphic processing unit
- the neural processor 110 may perform a neural network operation in particular among the functions and methods described herein.
- the neural processor 110 may perform an operation using the beamforming model described herein.
- the neural network may include a convolutional neural network (CNN), but the scope of the present invention is not limited thereto.
- CNN convolutional neural network
- SRAM static random access memory
- non-volatile memory such as flash memory, or a combination of volatile memory and non-volatile memory.
- the output module 130 performs beamforming on the sound signal received by the beamforming device 10 using the microphones M1 and M2, and then outputs a beamformed sound signal obtained as a result. of the input/output interface device.
- the beamforming method, the beamforming apparatus, and the beamforming system according to the embodiments of the present invention may be implemented as a program or software executed in a computing device at least some of the key control method, the key control device, and the user equipment. and the program or software may be stored in a computer-readable medium.
- the beamforming method, the beamforming apparatus, and the beamforming system according to embodiments of the present invention may be implemented as hardware capable of being electrically connected to the computing device.
- the beamforming device 10 as described in FIG. 1 is implemented to be attached to the monitor 20 and may be connected to another computing device to provide an output signal that distinguishes a human voice. It is particularly noteworthy here that the beamforming device 10 has its own neural processor 110 , so it uses its own neural processor 110 without using the computational resources of other computing devices to generate a lot of ambient noise. The advantage is that neural network computations can be performed to discern the voices of people participating in videoconferencing in the environment.
- FIGS. 3 to 5 a beamforming method according to embodiments of the present invention will be described with reference to FIG. 3 .
- FIG. 3 is a diagram for explaining a beamforming method according to an embodiment of the present invention.
- the first microphone M1 may receive the first sound signal S1 from the first side.
- the first microphone M1 receives the first sound signal S1 including the voice of a person participating in the video conference and ambient noise from the first side, and transmits it to the first STFT module 300 (Fig. In the above, it can be delivered to L_STFT module).
- the first STFT module 300 performs a Fourier transform operation on the first sound signal S1 received from the first microphone M1, and obtains a Fourier transform result P1 on the first sound signal S1. can be obtained
- the second microphone M2 disposed to be spaced apart from the first microphone M1 by a predetermined distance D may receive the second sound signal S2 from the second side.
- the second microphone M2 receives the second sound signal S2 including the voice of a person participating in the video conference and ambient noise from the second side, and transmits it to the second STFT module 301 (Fig. In the above, it can be transmitted to R_STFT module).
- the second STFT module 301 performs a Fourier transform operation on the second sound signal S2 received from the second microphone M2, and obtains a Fourier transform result P2 on the second sound signal S2. can be obtained
- the phase difference obtaining module 302 is configured to perform a first sound signal S1 from a Fourier transform result P1 provided from the first STFT module 300 and a Fourier transform result P2 provided from the second STFT module 301 . and the phase difference dP of the second sound signal S2 may be obtained.
- the learning module 303 may learn the beamforming model 304 by using the phase difference dP between the first sound signal S1 and the second sound signal S2 . Accordingly, the beamforming model 304 may be learned to perform beamforming only with a phase difference between two sound signals received through the first microphone M1 and the second microphone M2 .
- the predetermined distance D between the first microphone M1 and the second microphone M2 may be 10 cm to 14 cm, preferably 12 cm, but the scope of the present invention is not limited thereto .
- the predetermined distance D is 12 cm
- the performance of the trained beamforming model 304 is improved until the distance between the first microphone M1 and the second microphone M2 is 10 cm to 14 cm in inference. performance can be achieved.
- FIG. 4 is a view for explaining a beamforming method according to an embodiment of the present invention.
- the first microphone M1 may receive the first sound signal S1 from the first side.
- the first microphone M1 receives the first sound signal S1 including the voice of a person participating in the video conference and ambient noise from the first side, and transmits it to the first STFT module 310 (Fig. In the above, it can be delivered to L_STFT module).
- the first STFT module 310 performs a Fourier transform operation on the first sound signal S1 received from the first microphone M1, and obtains a Fourier transform result P1 on the first sound signal S1. can be obtained
- the second microphone M2 disposed to be spaced apart from the first microphone M1 by a predetermined distance D may receive the second sound signal S2 from the second side.
- the second microphone M2 receives the second sound signal S2 including the voice of a person participating in the video conference and ambient noise from the second side, and transmits it to the second STFT module 311 (Fig. In the above, it can be transmitted to R_STFT module).
- the second STFT module 311 performs a Fourier transform operation on the second sound signal S2 received from the second microphone M2, and obtains a Fourier transform result P2 on the second sound signal S2. can be obtained
- the phase difference obtaining module 312 is configured to perform a first sound signal S1 from a Fourier transform result P1 provided from the first STFT module 310 and a Fourier transform result P2 provided from the second STFT module 311 . and the phase difference dP of the second sound signal S2 may be obtained.
- the trained beamforming model 314 receives the phase difference dP between the first sound signal S1 and the second sound signal S2 as an input to perform neural network operation (ie, reasoning operation). can be done
- the masking module 315 may apply a mask to the speculation operation result, and then, the elemental product module 316 performs the speculation operation result (or the mask applied result) and the second received from the second STFT module 311 .
- An elemental product may be performed on the Fourier transform result P2 of the sound signal S2.
- the element product may be an operation of multiplying each component of two matrices of the same size.
- the output module 317 may output the elemental product result S3 provided from the elemental product module 316 .
- the output module 317 uses the beamforming model 314 to perform beamforming on a sound signal received using the microphones M1 and M2, and a beamformed sound signal ( S3) can be output.
- the beam-formed sound signal S3 may be a voice of a person participating in a video conference in an environment with a lot of ambient noise, and is provided to another computing device (eg, a personal computer to which the monitor 20 is connected). and can then be provided to other video conferencing participants.
- FIG. 5 is a diagram for explaining an example of an implementation of a beamforming method according to an embodiment of the present invention.
- Two or more microphones are basically required for directional hearing.
- the geometry of the microphone array has a spatial characteristic in which the signals received by each microphone are aligned.
- the process of obtaining a mask for beamforming can be formulated as follows. First, suppose that each signal received from a plurality of microphones is subjected to short time Fourier transform (STFT) to obtain a spectrogram. is the desired voice and is expressed as follows.
- STFT short time Fourier transform
- the input of the microphone array can be expressed as
- the superscript is a matrix transposed after taking the complex conjugate number. Therefore, the spectrogram of the speech to be obtained using the obtained filter is can be obtained with When implementing beamforming using this method, the most important part is the steering vector. and spatial correlation matrix is to find exactly
- a steering vector that mathematically models the path from the sound source to each microphone is required.
- Mathematical modeling is very difficult when the mouth of the speaker is located at a close distance, for example, around 1 m, and the distance between the microphones is also close, for example, several centimeters or tens of centimeters.
- the steering vector is set to a fixed value, the user's degree of discomfort is further increased.
- a neural network rather than a steering vector may be used.
- the phase difference matrix used as the input of the neural network in this method is simpler than the spatial correlation matrix used to obtain the location information of the sound source in the existing method, so that the location information of the sound source can be obtained more easily.
- the beamforming method according to an embodiment of the present invention only two microphones train a neural network for spatial information, and through this, a sound can be acquired in a predetermined direction. Therefore, it has the following advantages.
- the sound movement path is affected only by the angle of the sound source. Because the distance between the microphone and the sound source is the same, the sound coming from the front is the same. That is, the arrival time difference (TDOA) of the forward source approaches zero. By using this point, you can leave the sound you hear in front of you as it is.
- TDOA arrival time difference
- embodiments of the present invention provide a method of recognizing a phase difference through a spectrogram pattern of a neural network and a reference microphone.
- the mask to be designed is a mask between 0 and 1 defined by real numbers, it can perform a function similar to an IBM (ideal binary mask). However, since there is a value between them, it can be SBM (Soft Binary Mask).
- SBM Soft Binary Mask
- the noisy phase was used as it is. From this, when reconstructing speech as a spectrogram with noise, speech reconstruction of a signal is possible only by leaving a frequency domain related to the composition of the speech. Therefore, the mask obtained by the SBM type neural network creates a directional auditory mask applied to the magnitude from the phase difference, which functions to leave a signal even for an element having a phase difference.
- CNNs are more efficient for 2D matrices such as images and require less computational power.
- the convolution filter is optimized for the reduction width according to the phase difference of the ideal mask.
- back propagation can be used to prevent the problem that a target speech pattern cannot be learned when only the phase difference is input to the neural network.
- the mean squared error (MSE) obtained in the time domain is used as a loss function for training a mask with phase information.
- MSE mean squared error
- the aforementioned method is similar to this method.
- ISTFT inverse STFT
- a loss function can be used to compare it with a clear target sound in the time domain.
- the gradient value to be updated may include voice pattern information for the reference microphone.
- a predetermined data set was generated using a simulation method, the subject's voice was reconstructed using a neural network, and a stereo channel sound source was generated through a 10x10x10m spatial simulation.
- the height of the microphone is 2 m, and the two positions (9, 5.06, 2) and (9, 4.94, 2) are designated with an interval of 6 cm on the left and right.
- the sound source is located on a semicircle with a diameter of 1 meter, and the center of the semicircle is the same as the center of the microphone.
- the location of the sound source you want to obtain is the front , located at (7, 5, 2). and 4 segments on a semicircle Divide and place the noise source at random positions for each section.
- Some of the negative data sets were used for training data, and another part was used for test data.
- the noise data set two artificial noises (speech shape noise and babble noise) and a DEMAND database were used, which are 13 recorded noises.
- the training data consists of 2 artificial noises and 8 recorded noises (cafe, car, kitchen, meeting, metro, restaurant, station, traffic).
- the test data consisted of five recorded noises (bus, cafeteria, living room, office and public plaza). To produce noise, noise was extracted as much as the length of the voice at 4 random points of the same noise signal. After that, the method of accommodating two microphones was adopted by simulating the sound coming from four sound sources.
- the training data a combination of 40 conditions, that is, 10 noise situations and 4 SNRs (0dB, 5dB, 10dB, 15dB) was used.
- the test data used a combination of 20 different conditions, 5 noise situations and 4 SNRs (2.5dB, 7.5dB, 12.5dB, 17.5dB). In this case, only energy reduction according to distance was simulated using the room impulse response of the image-source method. In order not to consider reverberation, the reverberation time ( ) is set to 0.
- STFT uses a 256-point Hamming window for a signal at a 16 kHz sample rate.
- a window shift uses 128 points (128 points of overlap). The same conditions are used when performing the ISTFT operation after the neural network manipulation is complete.
- the model structure follows the structure in Table 1.
- the input value includes only 128 low frequencies among the STFT results.
- the mask obtained as a result of the neural network is multiplied by the spectrogram of 128 input frequencies, and the 129th frequency is filled with 0 and ISTFT is performed to obtain a reconstructed signal.
- the input consists of (batch, frequency, time step, channel).
- the convolutional layer is composed of (filter height, filter width), (stride height, stride width), (padding height, padding width).
- the output consists of (batch, frequency, time step, channel). All activation functions were PReLU.
- the activation function of the last layer uses a sigmoid function, and channels 1 and 2 are used as the real and imaginary parts of the mask, respectively.
- FIG. 6 is a diagram for explaining a beamforming system according to an embodiment of the present invention
- FIG. 7 is a diagram for explaining a beamforming system according to an embodiment of the present invention
- FIG. 8 is an embodiment of the present invention It is a diagram for explaining a beamforming system according to an example.
- the beamforming system 2 may be implemented as a monitor including a first microphone M1 and a second microphone M2.
- the beamforming system 2 may be implemented to include a cradle-type device 14 that can mount a portable computing device 22 including a smart phone. . And after performing beamforming on the sound signal received using the microphones M1 and M2, the beamformed sound signal obtained as a result may be provided to the portable computing device 22 through any connection means. there is. The computer device 22 may then provide the discerned human voice, eg, to other videoconference participants.
- the beamforming system 4 may be implemented as an attachable device 16 that can be attached to a portable computing device 22 including a smart phone. And after performing beamforming on the sound signal received using the microphones M1 and M2, the beamformed sound signal obtained as a result may be provided to the portable computing device 22 through any connection means. there is. The computer device 22 may then provide the discerned human voice, eg, to other videoconference participants.
- 9 and 10 are diagrams for explaining advantageous effects of a beamforming method and a beamforming system according to embodiments of the present invention.
- FIG. 9 shows the loudness (upper row) and short-time objective intelligibility (STOI) score (lower row) of each angle of beamforming using the classical method (MVDR), and FIG. Loudness (upper row) and STOI score (lower row) for each angle of beamforming using a neural network according to various embodiments are shown.
- STOI is an index related to the degree of restoration of sound regardless of the loudness of the sound.
- the STOI is high in a direction other than the desired direction, whereas the aspect of the neural network method according to various embodiments of the present invention appears low, and through this, it can be seen that the neural network method according to various embodiments of the present invention more reliably separates the voice spoken by the speaker in a non-desired direction and receives less interference.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Otolaryngology (AREA)
- Acoustics & Sound (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Circuit For Audible Band Transducer (AREA)
- Obtaining Desirable Characteristics In Audible-Bandwidth Transducers (AREA)
Abstract
Description
| Name | Input | Layer | Output |
| Conv1 | B,128,T,1 | (5,3),(1,1),(2,0) | B,128,T,5 |
| Conv2 | B,128,T,5 | (4,1),(2,1),(1,0) | B,64,T,5 |
| Conv3 | B,64,T,5 | (5,3),(1,1),(2,0) | B,64,T,10 |
| Conv4 | B,64,T,10 | (4,1),(2,1),(1,0) | B,32,T,10 |
| Conv5 | B,32,T,10 | (5,3),(1,1),(2,0) | B,32,T,18 |
| Conv6 | B,32,T,18 | (4,1),(2,1),(1,0) | B,16,T,18 |
| Conv7 | B,16,T,18 | (5,3),(1,1),(2,0) | B,16,T,32 |
| Conv8 | B,16,T,32 | (4,1),(2,1),(1,0) | B,8,T,32 |
| Conv9 | B,8,T,32 | (5,3),(1,1),(2,0) | B,8,T,32 |
| deConv1 | B,8,T,32 | (4,1),(2,1),(1,0) | B,16,T,64 |
| Conv10 | B,16,T,64 | (5,3),(1,1),(2,0) | B,16,T,18 |
| deConv2 | B,16,T,18 | (4,1),(2,1),(1,0) | B,32,T,36 |
| Conv11 | B,32,T,36 | (5,3),(1,1),(2,0) | B,32,T,10 |
| deConv3 | B,32,T,10 | (4,1),(2,1),(1,0) | B,64,T,20 |
| Conv12 | B,64,T,20 | (5,3),(1,1),(2,0) | B,64,T,5 |
| deConv4 | B,64,T,5 | (4,1),(2,1),(1,0) | B,128,T,10 |
| Conv13 | B,128,T,10 | (5,3),(1,1),(2,0) | B,128,T,5 |
| Conv14 | B,128,T,5 | (1,1),(1,1),(0,0) | B,128,T,2 |
Claims (10)
- 제1 마이크와, 상기 제1 마이크로부터 미리 정해진 거리만큼 이격되어 배치된 제2 마이크를 이용하여 제1 소리 신호 및 제2 소리 신호를 각각 수신하는 단계;상기 제1 소리 신호 및 상기 제2 소리 신호 각각에 대한 푸리에 변환 결과를 획득하는 단계;상기 푸리에 변환 결과로부터 상기 제1 소리 신호와 상기 제2 소리 신호의 위상차를 획득하는 단계;뉴럴 프로세서를 이용하여 상기 위상차를 빔포밍 모델에 입력하여 연산하는 단계;상기 뉴럴 프로세서의 연산 결과와 상기 제1 소리 신호에 대한 푸리에 변환 결과에 대해 원소곱을 수행하는 단계; 및상기 원소곱 결과를 출력하는 단계를 포함하는빔포밍 방법.
- 제1항에 있어서,상기 원소곱을 수행하는 단계는,상기 원소곱을 수행하기 전에 상기 연산 결과에 대해 마스크(mask)를 적용하는 단계를 더 포함하는 빔포밍 방법.
- 제1항에 있어서,상기 원소곱을 수행하는 단계는,상기 원소곱을 수행한 후에 이득 제어(gain control)를 수행하는 단계를 더 포함하는 빔포밍 방법.
- 제1항에 있어서,상기 미리 정해진 거리는 10 cm 내지 14 cm인, 빔포밍 방법.
- 제1항에 있어서,상기 위상차를 이용하여 상기 빔포밍 모델을 학습시키는 단계를 더 포함하는 빔포밍 방법.
- 제1 소리 신호를 수신하는 제1 마이크;상기 제1 마이크로부터 미리 정해진 거리만큼 이격되어 배치되어 제2 소리 신호를 수신하는 제2 마이크;상기 제1 소리 신호에 대한 푸리에 변환 결과를 획득하는 제1 STFT 모듈;상기 제2 소리 신호에 대한 푸리에 변환 결과를 획득하는 제2 STFT 모듈;상기 푸리에 변환 결과로부터 상기 제1 소리 신호와 상기 제2 소리 신호의 위상차를 획득하는 위상차 획득 모듈;상기 위상차를 입력받아 빔포밍 모델을 이용하여 뉴럴 네트워크 연산을 수행하는 뉴럴 프로세서;상기 뉴럴 프로세서의 연산 결과와 상기 제1 소리 신호에 대한 푸리에 변환 결과에 대해 원소곱을 수행하는 원소곱 모듈; 및상기 원소곱 결과를 출력하는 출력 모듈을 포함하는빔포밍 시스템.
- 제6항에 있어서,상기 원소곱을 수행하기 전에 상기 연산 결과에 대해 마스크를 적용하는 마스킹 모듈을 더 포함하는 빔포밍 시스템.
- 제6항에 있어서,상기 원소곱을 수행한 후에 이득 제어를 수행하는 이득 제어 모듈을 더 포함하는 빔포밍 시스템.
- 제6항에 있어서,상기 미리 정해진 거리는 10 cm 내지 14 cm인, 빔포밍 시스템.
- 제6항에 있어서,상기 위상차를 이용하여 상기 빔포밍 모델을 학습시키는 학습 모델을 더 포함하는 빔포밍 시스템.
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023551942A JP7591848B2 (ja) | 2020-11-04 | 2021-09-29 | ニューラルネットワークを用いたビームフォーミング方法及びビームフォーミングシステム |
| US18/035,297 US12477273B2 (en) | 2020-11-04 | 2021-09-29 | Beamforming method and beamforming system using neural network |
| EP21889384.0A EP4258567A4 (en) | 2020-11-04 | 2021-09-29 | BEAM FORMING METHOD AND BEAM FORMING SYSTEM WITH NEURAL NETWORK |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR10-2020-0146191 | 2020-11-04 | ||
| KR1020200146191A KR102412148B1 (ko) | 2020-11-04 | 2020-11-04 | 뉴럴 네트워크를 이용한 빔포밍 방법 및 빔포밍 시스템 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022097919A1 true WO2022097919A1 (ko) | 2022-05-12 |
Family
ID=81457019
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2021/013328 Ceased WO2022097919A1 (ko) | 2020-11-04 | 2021-09-29 | 뉴럴 네트워크를 이용한 빔포밍 방법 및 빔포밍 시스템 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US12477273B2 (ko) |
| EP (1) | EP4258567A4 (ko) |
| JP (1) | JP7591848B2 (ko) |
| KR (1) | KR102412148B1 (ko) |
| WO (1) | WO2022097919A1 (ko) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12335002B2 (en) * | 2022-02-11 | 2025-06-17 | Qualcomm Incorporated | Calibration application for mitigating millimeter wave signal blockage |
| KR102869018B1 (ko) * | 2022-09-14 | 2025-10-14 | (주) 오토노머스에이투지 | 머신러닝에 기반하여 주변 차량 정보를 생성함으로써 자율 주행을 지원하기 위한 학습 방법 및 학습 장치, 이를 이용한 테스트 방법 및 테스트 장치 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20180111271A (ko) * | 2017-03-31 | 2018-10-11 | 삼성전자주식회사 | 신경망 모델을 이용하여 노이즈를 제거하는 방법 및 장치 |
| KR20180115984A (ko) * | 2017-04-14 | 2018-10-24 | 한양대학교 산학협력단 | 심화신경망 기반의 잡음 및 에코의 통합 제거 방법 및 장치 |
| WO2019199554A1 (en) * | 2018-04-11 | 2019-10-17 | Microsoft Technology Licensing, Llc | Multi-microphone speech separation |
| US20200111483A1 (en) * | 2016-12-21 | 2020-04-09 | Google Llc | Complex evolution recurrent neural networks |
| US20200342891A1 (en) * | 2019-04-26 | 2020-10-29 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for aduio signal processing using spectral-spatial mask estimation |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8062918B2 (en) * | 2008-05-01 | 2011-11-22 | Intermolecular, Inc. | Surface treatment to improve resistive-switching characteristics |
| US9516417B2 (en) * | 2013-01-02 | 2016-12-06 | Microsoft Technology Licensing, Llc | Boundary binaural microphone array |
| US9460732B2 (en) * | 2013-02-13 | 2016-10-04 | Analog Devices, Inc. | Signal source separation |
| US9881631B2 (en) * | 2014-10-21 | 2018-01-30 | Mitsubishi Electric Research Laboratories, Inc. | Method for enhancing audio signal using phase information |
| US11133011B2 (en) | 2017-03-13 | 2021-09-28 | Mitsubishi Electric Research Laboratories, Inc. | System and method for multichannel end-to-end speech recognition |
| EP3649642A1 (en) * | 2017-07-03 | 2020-05-13 | Yissum Research Development Company of The Hebrew University of Jerusalem Ltd. | Method and system for enhancing a speech signal of a human speaker in a video using visual information |
| US10522167B1 (en) | 2018-02-13 | 2019-12-31 | Amazon Techonlogies, Inc. | Multichannel noise cancellation using deep neural network masking |
| US10573301B2 (en) * | 2018-05-18 | 2020-02-25 | Intel Corporation | Neural network based time-frequency mask estimation and beamforming for speech pre-processing |
| JP6903611B2 (ja) | 2018-08-27 | 2021-07-14 | 株式会社東芝 | 信号生成装置、信号生成システム、信号生成方法およびプログラム |
| US10726830B1 (en) * | 2018-09-27 | 2020-07-28 | Amazon Technologies, Inc. | Deep multi-channel acoustic modeling |
| US11435429B2 (en) * | 2019-03-20 | 2022-09-06 | Intel Corporation | Method and system of acoustic angle of arrival detection |
| EP4042415B1 (en) * | 2019-10-11 | 2026-01-28 | Pindrop Security, Inc. | Z-vectors: speaker embeddings from raw audio using sincnet, extended cnn architecture, and in-network augmentation techniques |
-
2020
- 2020-11-04 KR KR1020200146191A patent/KR102412148B1/ko active Active
-
2021
- 2021-09-29 JP JP2023551942A patent/JP7591848B2/ja active Active
- 2021-09-29 US US18/035,297 patent/US12477273B2/en active Active
- 2021-09-29 EP EP21889384.0A patent/EP4258567A4/en active Pending
- 2021-09-29 WO PCT/KR2021/013328 patent/WO2022097919A1/ko not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200111483A1 (en) * | 2016-12-21 | 2020-04-09 | Google Llc | Complex evolution recurrent neural networks |
| KR20180111271A (ko) * | 2017-03-31 | 2018-10-11 | 삼성전자주식회사 | 신경망 모델을 이용하여 노이즈를 제거하는 방법 및 장치 |
| KR20180115984A (ko) * | 2017-04-14 | 2018-10-24 | 한양대학교 산학협력단 | 심화신경망 기반의 잡음 및 에코의 통합 제거 방법 및 장치 |
| WO2019199554A1 (en) * | 2018-04-11 | 2019-10-17 | Microsoft Technology Licensing, Llc | Multi-microphone speech separation |
| US20200342891A1 (en) * | 2019-04-26 | 2020-10-29 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for aduio signal processing using spectral-spatial mask estimation |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4258567A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4258567A1 (en) | 2023-10-11 |
| JP2024508821A (ja) | 2024-02-28 |
| KR102412148B1 (ko) | 2022-06-22 |
| US20230269532A1 (en) | 2023-08-24 |
| KR20220060322A (ko) | 2022-05-11 |
| EP4258567A4 (en) | 2024-12-04 |
| US12477273B2 (en) | 2025-11-18 |
| JP7591848B2 (ja) | 2024-11-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US6826284B1 (en) | Method and apparatus for passive acoustic source localization for video camera steering applications | |
| CN103152500A (zh) | 多方通话中回音消除方法 | |
| Bub et al. | Knowing who to listen to in speech recognition: Visually guided beamforming | |
| JP5034607B2 (ja) | 音響エコーキャンセラシステム | |
| Bhattacharjee et al. | Fast and efficient acoustic feedback cancellation based on low rank approximation | |
| WO2022097919A1 (ko) | 뉴럴 네트워크를 이용한 빔포밍 방법 및 빔포밍 시스템 | |
| WO2019004582A1 (ko) | 아식칩과 스마트폰을 구비하는 실시간 음성인식 장치 | |
| CN108347511A (zh) | 消声装置和消声方法、通信设备和穿戴设备 | |
| WO2026077160A1 (zh) | 本地扩声方法 | |
| Schwartz et al. | Nested generalized sidelobe canceller for joint dereverberation and noise reduction | |
| Papp et al. | Hands-free voice communication with TV | |
| CN116343816A (zh) | 音频设备中语音提取方法、音频设备及计算机实现的方法 | |
| Konforti et al. | Multichannel acoustic echo cancellation with beamforming in dynamic environments | |
| Aroudi et al. | Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding | |
| CN117896467B (zh) | 一种用于立体声电话通信的回声消除方法及系统 | |
| WO2024084854A1 (ja) | 音調整方法、音調整装置、音調整システム及びプログラム | |
| WO2013168848A1 (ko) | 하모닉 주파수 사이의 종속관계를 이용한 암묵 신호 분리 방법 및 이를 위한 디믹싱 시스템 | |
| JP2002062900A (ja) | 収音装置及び受信装置 | |
| JP2022172600A (ja) | 情報処理装置、情報処理方法、及びプログラム | |
| EP4576079A1 (en) | Apparatus, methods and computer programs for noise suppression | |
| Lin et al. | Design of novel field programmable gate array-based hearing aid | |
| WO2019003131A1 (en) | DIGITAL AUDIO SIGNAL PROCESSING METHOD AND SYSTEM THEREOF | |
| Steele | A Direction Finding–Beam Forming Conference Microphone System | |
| Ichikawa et al. | A Method For Estimating The Grouping Of Participants In Classroom Group Work Using Only Audio Information | |
| Praveen et al. | A frequency-domain adaptive filter (FDAF) prediction error method and ARLS for speech echo cancellation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21889384 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023551942 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021889384 Country of ref document: EP Effective date: 20230605 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 18035297 Country of ref document: US |
