WO2022143119A1 - 声音采集方法、电子设备及系统 - Google Patents
声音采集方法、电子设备及系统 Download PDFInfo
- Publication number
- WO2022143119A1 WO2022143119A1 PCT/CN2021/137406 CN2021137406W WO2022143119A1 WO 2022143119 A1 WO2022143119 A1 WO 2022143119A1 CN 2021137406 W CN2021137406 W CN 2021137406W WO 2022143119 A1 WO2022143119 A1 WO 2022143119A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- electronic device
- audio signal
- microphone
- signal
- coordinate system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/08—Mouthpieces; Microphones; Attachments therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers
- H04R3/005—Circuits for transducers for combining the signals of two or more microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers
- H04R3/04—Circuits for transducers for correcting frequency response
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
- G06V40/166—Detection; Localisation; Normalisation using acquisition arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/10—Earpieces; Attachments therefor ; Earphones; Monophonic headphones
- H04R1/1091—Details not provided for in groups H04R1/1008 - H04R1/1083
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
- H04S7/303—Tracking of listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/10—Details of earpieces, attachments therefor, earphones or monophonic headphones covered by H04R1/10 but not provided for in any of its subgroups
- H04R2201/107—Monophonic and stereophonic headphones with microphone for two-way hands free communication
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2410/00—Microphones
- H04R2410/03—Reduction of intrinsic noise in microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R5/00—Stereophonic arrangements
- H04R5/033—Headphones for stereophonic communication
Definitions
- the present application relates to the field of terminal technologies, and in particular, to a sound collection method, electronic device and system.
- true wireless stereo (TWS) headphones can transmit audio signals more efficiently and with high quality with electronic devices such as mobile phones and tablet computers, and are favored by more and more users. Users can collect sound through the microphone of the TWS headset during live broadcast, video blog (video blog, vlog) and other video recording scenarios.
- video blog video blog, vlog
- the sound pickup and noise reduction scheme adopted by the existing TWS headset is limited by the number of microphones in the headset and the user's wearing posture. TWS earphones are less effective at suppressing ambient noise and enhancing target sound.
- the present application provides a sound collection method, electronic device and system.
- a user wears a wireless headset and uses the first electronic device to record a video
- the wireless headset and the microphone of the first electronic device can be used for sound collection.
- the first electronic device can perform noise reduction processing on the audio signals obtained through the plurality of microphones, so as to improve the sound quality in the recorded video.
- the present application provides a sound collection method.
- the method can be applied to a first electronic device with a first microphone, a left-ear wireless headset with a second microphone, and a right-ear wireless headset with a third microphone.
- the first electronic device is connected to the left ear wireless earphone and the right ear wireless earphone through wireless communication.
- the method may include: the first electronic device may acquire a face image.
- the first electronic device may determine relative positions of the first microphone, the second microphone, and the third microphone based on the face image and the posture information of the first electronic device.
- the first electronic device may acquire the first audio signal of the first microphone, the second audio signal of the second microphone, and the third audio signal of the third microphone.
- the first electronic device may perform noise reduction processing on the first audio signal, the second audio signal and the third audio signal based on the relative positions.
- the above-mentioned first microphone, second microphone and third microphone may form a first microphone array.
- the first microphone may comprise one or more microphones in the first electronic device.
- the second microphone may comprise one or more microphones in the left ear wireless headset.
- the third microphone may comprise one or more microphones in the right ear wireless headset.
- the near-field area formed by the first microphone array includes the user wearing the TWS headset and the area where the first electronic device 100 is located.
- the size of the first microphone array is larger and the spatial resolution capability is stronger, which can more accurately distinguish the target sound in the near-field area and the ambient noise from the far-field area.
- the target sound can be better enhanced, the environmental noise can be suppressed, and the sound quality of the recorded video can be improved.
- adding a microphone in the first electronic device 100 to the microphone array can reduce the influence of the user's posture of wearing the TWS headset on the first electronic device 100 to enhance the target sound and reduce environmental noise.
- the above-mentioned target sound may be a sound whose sound source is located in the near-field area of the first microphone array.
- the above-mentioned target sound may include the user's voice and the sound of the user playing a musical instrument.
- the first electronic device before the first electronic device performs noise reduction processing on the first audio signal, the second audio signal, and the third audio signal based on the above-mentioned relative positions, the first audio signal, the second audio signal, and the third audio signal are time-delay aligned.
- the first electronic device may emit an alignment sound.
- the alignment sound is obtained by digital-to-analog conversion of the alignment signal.
- the first electronic device may perform time delay correlation detection on the first alignment signal portion in the first audio signal, the second alignment signal portion in the second audio signal, and the third alignment signal portion in the third audio signal, and determine the first alignment signal portion.
- the first electronic device may perform time delay alignment on the first audio signal, the second audio signal, and the third audio signal based on the time extension.
- the above-mentioned alignment signal may be an audio signal with a frequency higher than 20000 Hz.
- the audible frequency range of the human ear is between 20Hz and 20000Hz.
- the frequency of the alignment signal is higher than 20000Hz, and the alignment sound will not be heard by the user. This avoids aligning the sound from being distracting to the user.
- the above-mentioned time delay alignment can determine that the first audio signal, the second audio signal and the third audio signal belong to the data collected at the same time, thereby reducing the influence of the time delay error on the noise reduction processing performed by the first electronic device.
- the first electronic device may use any one of the first audio signal, the second audio signal, and the third audio signal as the reference audio signal, and when comparing other audio signals with the reference audio signal extended alignment.
- the first electronic device may use the first audio signal as the reference audio signal.
- the first electronic device may perform high-pass filtering on the first audio signal, the second audio signal and the third audio signal to obtain an alignment signal portion in these audio signals.
- the first electronic device may perform delay processing of different time lengths on the aligned signal parts in the second audio signal and the third audio signal, to determine at which time extension degree the second audio signal and the third audio signal are extended.
- the alignment signal portion has the highest correlation with the alignment signal portion in the first audio signal. In this way, the first electronic device can determine the time extension of the second audio signal and the third audio signal relative to the first audio signal.
- the first electronic device may generate the above-mentioned alignment signal during a preset time period for starting recording (for example, a preset time period before starting recording or a preset time period after starting recording), and send out the alignment signal. sound.
- the first electronic device may determine the time extension between the first audio signal, the second audio signal, and the third audio signal according to the alignment signal of the preset time period.
- the above-mentioned degree of elongation generally does not change or changes very little during the recording process of the first electronic device.
- the first electronic device may perform time delay alignment on the first audio signal, the second audio signal and the third audio signal according to the above time extension. Also, the first electronic device may stop generating the above-mentioned alignment signal, so as to save power consumption of the first electronic device.
- the relative positions of the first microphone, the second microphone, and the third microphone may include coordinates of the first microphone, the second microphone, and the third microphone in the world coordinate system.
- the first electronic device may, based on the correspondence between the coordinates of the first face key point in the standard head coordinate system and the coordinates in the face image coordinate system, determine the first coordinate between the standard head coordinate system and the first electronic device coordinate system.
- a conversion relationship can be determined according to the standard human head model.
- the first electronic device may store the coordinates of each key point in the standard human head model in the standard human head coordinate system.
- the first electronic device may determine the coordinates of the left and right ears in the standard human head model in the first electronic device coordinate system based on the first conversion relationship and the coordinates of the left and right ears in the standard human head model in the standard human head coordinate system.
- the coordinates of the left ear and the right ear in the coordinate system of the first electronic device in the standard human head model are the coordinates of the second microphone and the third microphone in the coordinate system of the first electronic device, respectively.
- the first electronic device may determine the second conversion relationship between the coordinate system of the first electronic device and the world coordinate system according to the attitude information of the first electronic device.
- the first electronic device may be based on the second conversion relationship, the coordinates of the first microphone, the second microphone, and the third microphone in the coordinate system of the first electronic device (that is, the above-mentioned relative positions).
- the above-mentioned standard human head coordinate system can take the position of the nose tip in the standard human head model as the origin, take the direction perpendicular to the face as the direction of the x-axis, take the horizontal direction parallel to the face as the direction of the y-axis, and take the direction parallel to the face as the direction of the y-axis.
- a three-dimensional coordinate system established with the vertical direction of the face as the direction of the z-axis.
- the above-mentioned first face key points may include any number of key points in the region where the face is located in the face image. For example, key points in the area where the forehead area is located, key points in the area where the cheeks are located, key points in the area where the lips are located, etc.
- the above-mentioned attitude information may be determined by the first electronic device through an attitude sensor (eg, an acceleration sensor, a gyroscope sensor).
- an attitude sensor eg, an acceleration sensor, a gyroscope sensor.
- the pose of the first electronic device is unknown and time-varying.
- the target sound to be collected by the first microphone array not only includes the user's voice, but also includes the user's sound of playing a musical instrument.
- the use of the coordinates of each microphone in the first microphone array in the world coordinate system in the process of spatial filtering by the first electronic device can better improve The effect of enhancing target sound and reducing ambient noise.
- the method for performing noise reduction processing on the first audio signal, the second audio signal and the third audio signal based on the relative position of the first electronic device may specifically include: the first electronic device may be based on The voice activity detection is performed on the first audio signal, the second audio signal and the third audio signal relative to the position.
- the voice activity detection may be used to determine the frequency points of the target sound signal and the frequency points of the ambient noise signal in the first audio signal, the second audio signal and the third audio signal.
- the first electronic device may update the noise space characteristic of the ambient noise based on the frequency point of the target sound signal and the frequency point of the ambient noise signal.
- Noise spatial characteristics can be used to indicate the spatial distribution of ambient noise.
- the spatial distribution of ambient noise includes the direction and energy of ambient noise.
- the first electronic device may determine target steering vectors for the first audio signal, the second audio signal, and the third audio signal based on the relative positions.
- the target steering vector can be used to indicate the direction of the target sound signal.
- the first electronic device determines a spatial filter based on the noise spatial characteristic and the target steering vector, and uses the spatial filter to perform spatial filtering on the first audio signal, the second audio signal, and the third audio signal to enhance the first audio signal, the second audio signal, and the second audio signal. the target sound signal in the audio signal and the third audio signal and suppress the ambient noise signal therein.
- the left ear wireless earphone can perform wearing detection.
- the above-mentioned wearing detection can be used to determine whether the left ear wireless earphone is in the in-ear state.
- the left-ear wireless earphone obtains the second audio signal by using the second microphone.
- the right ear wireless headset can perform wearing detection.
- the wireless earphone for the right ear is in the in-ear state
- the wireless earphone for the right ear obtains the third audio signal by using the third microphone.
- the first electronic device may inquire about the wearing detection result from the left ear wireless earphone and the right ear wireless earphone. When it is determined that both the left-ear wireless earphone and the right-ear wireless earphone are in the in-ear state, the first electronic device may turn on the first microphone, and send an instruction to turn on the microphone to the left-ear wireless earphone and the right-ear wireless earphone. When receiving an instruction to turn on the microphone from the first electronic device, the left-ear wireless earphone and the right-ear wireless earphone can turn on the second microphone and the third microphone, respectively.
- the left-ear wireless headset may turn on the second microphone when the wearing detection is performed and it is determined that the left-ear wireless headset is in the in-ear state.
- the right-ear wireless earphone can turn on the third microphone when the wearing detection is performed and it is determined that the right-ear wireless earphone is in the in-ear state. That is, the first electronic device may not need to send an instruction to turn on the microphone to the left-ear wireless earphone and the right-ear wireless earphone.
- the first electronic device may mix the first video captured during the first time period with the fourth audio signal.
- the fourth audio signal may be the audio signal after noise reduction processing of the first audio signal, the second audio signal and the third audio signal.
- the first audio signal, the second audio signal, and the third audio signal may be obtained through the first microphone, the second microphone, and the third microphone in the first time period, respectively.
- the microphone of one of the left ear wireless earphone and the right ear wireless earphone may form a microphone array with the microphone of the first electronic device.
- the microphone array can also be applied to the sound collection method provided in the embodiment of the present application.
- the first microphone of the first electronic device and the second microphone of the left wireless headset can perform sound collection.
- the first electronic device can obtain the first audio signal through the first microphone.
- the left-ear wireless headset can obtain the second audio signal through the second microphone.
- the left ear wireless earphone can send the second audio signal to the first electronic device.
- the first electronic device may perform noise reduction processing on the first audio signal and the second audio signal to obtain the audio signal in the live video.
- the microphone of the above-mentioned one earphone may form a microphone array with the microphone of the first electronic device.
- the near-field area of the microphone array may still include the location of the sound source of the target sound, such as the user's voice and the sound of the user playing a musical instrument. Also, using a headset's microphone can save the power consumption of the headset.
- the present application further provides a sound collection method.
- the method is applied to a first electronic device having a first microphone.
- the first electronic device is connected to the left-ear wireless earphone with the second microphone and the right-ear wireless earphone with the third microphone through wireless communication.
- the method may include: acquiring a face image by the first electronic device.
- the first electronic device determines relative positions of the first microphone, the second microphone, and the third microphone based on the face image and the posture information of the first electronic device.
- the first electronic device acquires the first audio signal of the first microphone, the second audio signal of the second microphone, and the third audio signal of the third microphone.
- the first electronic device performs noise reduction processing on the first audio signal, the second audio signal and the third audio signal based on the relative positions.
- the above-mentioned first microphone, second microphone and third microphone may form a first microphone array.
- the first microphone may comprise one or more microphones in the first electronic device.
- the second microphone may comprise one or more microphones in the left ear wireless headset.
- the third microphone may comprise one or more microphones in the right ear wireless headset.
- the near-field area formed by the first microphone array includes the user wearing the TWS headset and the area where the first electronic device 100 is located.
- the size of the first microphone array is larger and the spatial resolution capability is stronger, which can more accurately distinguish the target sound in the near-field area and the ambient noise from the far-field area.
- the target sound can be better enhanced, the environmental noise can be suppressed, and the sound quality of the recorded video can be improved.
- adding a microphone in the first electronic device 100 to the microphone array can reduce the influence of the user's posture of wearing the TWS headset on the first electronic device 100 to enhance the target sound and reduce environmental noise.
- the above-mentioned target sound may be a sound whose sound source is located in the near-field area of the first microphone array.
- the above-mentioned target sound may include the user's voice and the sound of the user playing a musical instrument.
- the first electronic device before the first electronic device performs noise reduction processing on the first audio signal, the second audio signal, and the third audio signal based on the above-mentioned relative positions, the first audio signal, the second audio signal, and the third audio signal are time-delay aligned.
- the first electronic device may emit an alignment sound.
- the alignment sound is obtained by digital-to-analog conversion of the alignment signal.
- the first electronic device may perform time delay correlation detection on the first alignment signal portion in the first audio signal, the second alignment signal portion in the second audio signal, and the third alignment signal portion in the third audio signal, and determine the first alignment signal portion.
- the first electronic device may perform time delay alignment on the first audio signal, the second audio signal, and the third audio signal based on the time extension.
- the above-mentioned alignment signal may be an audio signal with a frequency higher than 20000 Hz.
- the audible frequency range of the human ear is between 20Hz and 20000Hz.
- the frequency of the alignment signal is higher than 20000Hz, and the alignment sound will not be heard by the user. This avoids aligning the sound from being distracting to the user.
- the above-mentioned time delay alignment can determine that the first audio signal, the second audio signal and the third audio signal belong to the data collected at the same time, so as to reduce the influence of the time delay error on the noise reduction processing performed by the first electronic device.
- the relative positions of the first microphone, the second microphone, and the third microphone may include coordinates of the first microphone, the second microphone, and the third microphone in the world coordinate system.
- the first electronic device may, based on the correspondence between the coordinates of the first face key point in the standard head coordinate system and the coordinates in the face image coordinate system, determine the first coordinate between the standard head coordinate system and the first electronic device coordinate system.
- a conversion relationship can be determined according to the standard human head model.
- the first electronic device may store the coordinates of each key point in the standard human head model in the standard human head coordinate system.
- the first electronic device may determine the coordinates of the left and right ears in the standard human head model in the first electronic device coordinate system based on the first conversion relationship and the coordinates of the left and right ears in the standard human head model in the standard human head coordinate system.
- the coordinates of the left ear and the right ear in the coordinate system of the first electronic device in the standard human head model are the coordinates of the second microphone and the third microphone in the coordinate system of the first electronic device, respectively.
- the first electronic device may determine the second conversion relationship between the coordinate system of the first electronic device and the world coordinate system according to the attitude information of the first electronic device.
- the first electronic device may be based on the second conversion relationship, the coordinates of the first microphone, the second microphone, and the third microphone in the coordinate system of the first electronic device (ie, the above-mentioned relative positions).
- the pose of the first electronic device is unknown and time-varying.
- the target sound to be collected by the first microphone array not only includes the user's voice, but also includes the user's sound of playing a musical instrument.
- the use of the coordinates of each microphone in the first microphone array in the world coordinate system in the process of spatial filtering by the first electronic device can better improve The effect of enhancing target sound and reducing ambient noise.
- the method for performing noise reduction processing on the first audio signal, the second audio signal and the third audio signal based on the relative position of the first electronic device may specifically include: the first electronic device may be based on The voice activity detection is performed on the first audio signal, the second audio signal and the third audio signal relative to the position.
- the voice activity detection may be used to determine the frequency points of the target sound signal and the frequency points of the ambient noise signal in the first audio signal, the second audio signal and the third audio signal.
- the first electronic device may update the noise space characteristic of the ambient noise based on the frequency point of the target sound signal and the frequency point of the ambient noise signal.
- Noise spatial characteristics can be used to indicate the spatial distribution of ambient noise.
- the spatial distribution of ambient noise includes the direction and energy of ambient noise.
- the first electronic device may determine target steering vectors for the first audio signal, the second audio signal, and the third audio signal based on the relative positions.
- the target steering vector can be used to indicate the direction of the target sound signal.
- the first electronic device determines a spatial filter based on the noise spatial characteristic and the target steering vector, and uses the spatial filter to perform spatial filtering on the first audio signal, the second audio signal, and the third audio signal to enhance the first audio signal, the second audio signal, and the second audio signal. the target sound signal in the audio signal and the third audio signal and suppress the ambient noise signal therein.
- the first electronic device may mix the first video captured during the first time period with the fourth audio signal.
- the fourth audio signal may be the audio signal after noise reduction processing of the first audio signal, the second audio signal and the third audio signal.
- the first audio signal, the second audio signal, and the third audio signal may all be audio signals collected in the first time period.
- the present application provides an electronic device, which may include a communication device, a camera, a microphone, a memory, and a processor.
- the communication device can be used to establish a communication connection with the wireless headset.
- a camera can be used to capture images.
- Microphones can be used for sound collection.
- the memory can be used to store standard human head coordinate systems as well as computer programs.
- the processor may be configured to invoke a computer program to cause the electronic device to execute any of the possible implementations of the second aspect above.
- the present application provides a computer storage medium, comprising instructions, when the above-mentioned instructions are executed on an electronic device, the above-mentioned electronic device is made to execute any one of the possible implementations of the above-mentioned first aspect, or the above-mentioned electronic device is made to execute Any possible implementation manner of the second aspect above.
- an embodiment of the present application provides a chip, the chip is applied to an electronic device, the chip includes one or more processors, and the processor is configured to invoke computer instructions to cause the electronic device to execute any one of the first aspects above a possible implementation manner, or the above-mentioned electronic device is made to perform any one of the possible implementation manners of the foregoing second aspect.
- an embodiment of the present application provides a computer program product containing instructions, when the computer program product is run on a device, the electronic device is made to execute any one of the possible implementations of the first aspect above, or the above The electronic device executes any one of the possible implementations of the second aspect above.
- the electronic device provided in the third aspect, the computer storage medium provided in the fourth aspect, the chip provided in the fifth aspect, and the computer program product provided in the sixth aspect are all used to execute the method provided by the embodiments of the present application. Therefore, for the beneficial effects that can be achieved, reference may be made to the beneficial effects in the corresponding method, which will not be repeated here.
- FIG. 1A is a schematic diagram of a sound collection scene provided by an embodiment of the present application.
- 1B is a schematic diagram of the spatial resolution capability of a microphone array provided by an embodiment of the present application.
- FIG. 2A is a schematic diagram of another sound collection scene provided by an embodiment of the present application.
- FIG. 2B is a schematic diagram of the spatial resolution capability of another microphone array provided by an embodiment of the present application.
- 3A to 3G are schematic diagrams of some sound collection scenarios provided by embodiments of the present application.
- FIG. 4 is a schematic structural diagram of a sound collection system provided by an embodiment of the present application.
- 5A and 5B are schematic diagrams of a method for performing time delay alignment on an audio signal provided by an embodiment of the present application.
- FIGS. 6A to 6C are schematic diagrams of some coordinate transformations provided by embodiments of the present application.
- FIG. 7A is a flowchart of a method for enabling a first microphone array provided by an embodiment of the present application.
- FIG. 7B is a flowchart of a method for sound collection provided by an embodiment of the present application.
- FIG. 8 is a schematic structural diagram of a first electronic device 100 provided by an embodiment of the present application.
- first and second are only used for descriptive purposes, and should not be construed as implying or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the “multiple” The meaning is two or more.
- the first electronic device (eg, mobile phone, tablet computer, etc.) establishes a communication connection with the TWS headset.
- the first electronic device may collect sound through the microphone of the TWS headset.
- the TWS earphone may utilize the microphone array of one earphone of the left ear TWS earphone 201 or the right ear TWS earphone 202 to collect sound.
- the sound collected by TWS headphones often includes ambient noise.
- the aforementioned ambient noise mainly comes from the far-field region of the microphone array in a single TWS headset.
- the environmental noise can be interference noise from multiple directions such as front interference noise, rear interference noise, left interference noise, and right interference noise of the TWS headset.
- the first electronic device can perform spatial filtering on the audio signal collected by the microphone array in a single TWS earphone to suppress ambient noise and enhance the target sound in the near-field area. But TWS earphones are limited in size. The near-field area of the microphone array in a single TWS headset is very small. The location where the user's mouth utters is already in the far-field region of the microphone array in the single TWS headset described above. Then, it is difficult for the first electronic device to distinguish between the user's voice and ambient noise in the same direction in the audio signal collected by the microphone array in the single TWS earphone.
- Figure IB illustrates the spatial resolution capability of the microphone array of a single TWS headset. Due to the short distance between the microphones in the microphone array of a single TWS headset, the size of the microphone array is insufficient, and the spatial resolution capability of the microphone array of a single TWS headset is weak.
- the near-field target source of the microphone array of a single TWS headset often fails to include the sound source from the user's speech.
- the orientation of the microphone array of a single TWS headset is easily affected by the user's wearing posture, which causes the direction of the main lobe of the beam to deviate from the direction of the user's mouth during spatial filtering.
- the pickup capability of the microphone array of a single TWS headset is further weakened. Moreover, the microphone array of a single TWS headset is limited by size constraints, and the spatial spectrum peaks of the determined sound source are relatively broad. The target source signal is very easily obscured by ambient noise signals. Microphone arrays in a single TWS headset have difficulty accurately distinguishing between target sources and ambient noise. When performing spatial filtering on the audio signal collected by the microphone array in a single TWS earphone, the effect of the first electronic device in suppressing environmental noise and enhancing the user's voice is poor. The environment noise in the video recorded by the first electronic device is relatively large, and the user's voice is not clear enough.
- the present application provides a sound collection method.
- This method can establish a communication connection between the TWS headset and the first electronic device 100 (such as a mobile phone, a tablet, etc.), and when the first electronic device 100 records a video, the target sound can be enhanced, the environmental noise can be suppressed, and the sound quality in the recorded video can be improved.
- the microphone of the first electronic device 100 and the microphone of the binaural TWS earphone may form a first microphone array.
- the above-mentioned target sound is the sound of the sound source in the near-field region of the first microphone array.
- the above-mentioned target sound may be, for example, the user's voice or the sound of the user playing a musical instrument.
- the size of the first microphone array is larger.
- the near-field area of the first microphone array may include an area formed by the user and the position of the first electronic device 100 .
- the above-mentioned ambient noise is mainly the sound of the sound source in the far-field region of the first microphone array.
- FIG. 2B exemplarily shows the spatial resolution capability of the first microphone array.
- the near-field area formed by the first microphone array combined with the microphones of the first electronic device 100 is larger than the near-field area of the microphone array of a single TWS headset, and is less affected by the wearing posture of the user.
- the spatial resolution capability of the first microphone array is stronger.
- the first microphone array can more accurately distinguish the direction and distance between each sound source and the first microphone array during the video recording process of the first electronic device.
- the spatial energy spectrum of ambient noise from any direction in the far-field region will be significantly attenuated when it reaches the near-field region, and the first electronic device 100 can better distinguish the ambient noise from the target sound.
- the first electronic device may perform spatial filtering on the audio signal collected by the first microphone array, and suppress the environmental noise and enhance the target sound according to the direction and distance of the target sound and the environmental noise. This reduces the influence of ambient noise on the video recorded by the first electronic device, and improves the sound quality in the recorded video.
- FIG. 2B The schematic diagram of the spatial resolution capability shown in FIG. 2B is only used to explain the present application, and should not constitute a limitation on the spatial resolution capability of the first microphone array.
- the number of microphones of the first electronic device 100 constituting the above-mentioned first microphone array may be one or more. There may also be one or more microphones of the left ear and right ear TWS earphones forming the above-mentioned first microphone array. That is to say, the first microphone array includes at least three microphones. The three microphones come from the first electronic device 100 , the left TWS earphone 201 , and the right TWS earphone 202 respectively.
- the headset establishes a communication connection with the first electronic device 100 and the first electronic device records a video
- the microphone of the first electronic device 100 and the microphone of the headset may form the first microphone array to collect sound.
- the above-mentioned first electronic device 100 may be a mobile phone, a tablet computer, a notebook computer, a television, an ultra-mobile personal computer (UMPC), a handheld computer, a netbook, a personal digital assistant (personal digital assistant, PDA), etc. with a camera. electronic equipment.
- UMPC ultra-mobile personal computer
- PDA personal digital assistant
- This embodiment of the present application does not limit the specific type of the first electronic device 100 .
- a microphone array consisting of a microphone in the first electronic device 100 , a microphone in the left TWS earphone 201 , and a microphone in the right TWS earphone 202 is used as an example to introduce the sound collection provided by this application. method.
- the near-field area and the far-field area of the microphone array mentioned in the embodiments of the present application are described below.
- the sound field model can be divided into two models: the near-field model and the far-field model.
- the sound field model in the near-field region of the microphone array is a near-field model.
- the microphone array response is not only related to the direction of the incident sound source, but also to the distance of the incident sound source.
- Sound waves are spherical waves.
- the sound field model in the far-field region of the microphone array is a far-field model. In the far-field model, the microphone array response is only related to the direction of the incident sound source, not the distance of the incident sound source. Sound waves are plane waves.
- the size of the near-field area of the microphone array is positively related to the size of the microphone array. There is no absolute standard for dividing the near-field area and the far-field area. In some embodiments, the region whose distance from the central reference point of the microphone array is greater than the wavelength of the sound signal is the far-field region of the microphone array. Conversely, it is the near-field area of the microphone array. The embodiments of the present application do not limit the manner of dividing the near-field area and the far-field area of the microphone array.
- the size of the first microphone array is relatively large, which can better include the position of the sound source of the sound to be enhanced (eg, the user's voice) in the near-field area of the first microphone array.
- the sound source of the target sound and the sound source of the ambient noise can be better distinguished by the difference in direction and distance.
- the sound source of the target sound may be a sound source located in the near-field region of the first microphone array.
- the sound source of the aforementioned ambient noise may be a sound source located in the far-field region of the first microphone array.
- the sound collection method of the present application is applicable to the scenario where the first electronic device 100 is connected to the TWS headset, and the front camera is turned on for live broadcast, shooting of vlogs, and other video recording scenarios. Then, the above-mentioned target sound may include the user's voice and the sound of the user playing a musical instrument.
- the first electronic device 100 may detect a user operation to turn on the sound source enhancement function.
- the first electronic device 100 can use its own microphone, the microphone of the left TWS earphone 201 and the microphone of the right TWS earphone 202 to form a first microphone array to collect sound.
- the first electronic device 100 can enhance the target sound from the user according to the audio collected by the first microphone array, and suppress the ambient noise therein, thereby improving the sound quality of the recorded video.
- 3A to 3G exemplarily show schematic diagrams of a sound collection scenario provided by an embodiment of the present application.
- both the left ear TWS earphone 201 and the right ear TWS earphone 202 have established a Bluetooth connection with the first electronic device 100 .
- the first electronic device 100 may send the audio to be played to the TWS headset through Bluetooth.
- the audio output device of the TWS headset such as a speaker, can play the received audio.
- the TWS headset can send the audio collected through the audio input device (eg, microphone) to the first electronic device 100 through Bluetooth.
- the first electronic device 100 may be configured with a camera 193 .
- the camera 193 may include a front camera and a rear camera.
- the front camera of the first electronic device 100 may be as shown in FIG. 3A .
- This embodiment of the present application does not limit the number of cameras 193 of the first electronic device 100 .
- the first electronic device 100 may display the user interface 310 as shown in FIG. 3A .
- User interface 310 may include a camera application icon 311.
- the first electronic device 100 may open the camera application.
- the first electronic device 100 may turn on the front camera and display the user interface 320 as shown in FIG. 3B .
- the first electronic device 100 may also turn on the rear camera, or turn on the front camera and the rear camera. This embodiment of the present application does not limit this.
- the user interface 310 may further include more content, which is not limited in this embodiment of the present application.
- the user interface 320 may include a flash control 321 , a setting control 322 , a preview frame 323 , a camera mode option 324 , a gallery shortcut key 325 , a recording start control 326 , and a camera flip control 327 . in:
- Flash control 321 can be used to turn the flash on or off.
- the setting control 322 can be used to adjust the parameters of video recording (such as resolution, etc.) and turn on or off some methods for video recording (such as silent shooting, etc.).
- the preview frame 323 can be used to display the images captured by the camera 193 in real time.
- the first electronic device 100 may refresh the displayed content in real time, so that the user can preview the image currently captured by the camera 193 .
- One or more shooting mode options may be displayed in camera mode options 324 .
- the one or more shooting mode options may include: portrait mode options, photo mode options, video mode options, professional mode options, more options.
- the one or more shooting mode options can be represented as text information on the interface, such as "portrait”, “photograph”, “video”, “professional”, “more”.
- the one or more camera options may also be represented as icons or other forms of interactive elements (IEs) on the interface.
- IEs interactive elements
- the first electronic device 100 can start the shooting mode selected by the user.
- the first electronic device 100 can further display more other shooting mode options, such as slow-motion shooting mode options, etc., and can show the user richer camera functions.
- more options may not be displayed in the camera mode option, and the user may browse other shooting mode options by swiping left/right in the camera mode option 324 .
- Gallery shortcut 325 can be used to open the Gallery application.
- the first electronic device 100 may open the gallery application.
- the gallery application is a picture management application on an electronic device such as a smart phone and a tablet computer, and may also be called an "album", and the name of the application is not limited in this embodiment.
- the gallery application can support the user to perform various operations on the pictures stored on the first electronic device 100, such as browsing, editing, deleting, selecting and other operations.
- the recording start control 326 can be used to monitor the user operation that triggers the start of recording. Exemplarily, as shown in FIG. 3B , the recording mode option is selected.
- the first electronic device 100 may start recording.
- the first electronic device 100 may save the images captured by the camera 193 in the time period from the start of recording to the end of recording as videos in the order of collection time.
- the audio in the video may come from audio collected by an audio input device (such as a microphone) of the first electronic device 100 and/or an audio input device (such as a microphone) of a TWS headset that has established a Bluetooth connection with the first electronic device 100 captured audio.
- the camera rollover control 327 can be used to monitor the user operation that triggers rollover of the camera.
- the first electronic device 100 may flip the camera. For example, switch the rear camera to the front camera.
- the image captured by the front camera as shown in FIG. 3B may be displayed in the preview frame 323 .
- the user interface 320 may further include more or less content, which is not limited in this embodiment of the present application.
- the first electronic device 100 may display the user interface 330 as shown in FIG. 3C.
- User interface 330 may include a return control 331, a resolution control 332, a geographic location control 333, a capture mute control 334, and a sound source enhancement control 335.
- the sound source enhancement control 335 can be used by the user to enable or disable the sound source enhancement function.
- the first electronic device 100 will turn on the microphone of the device when recording video.
- the sound source enhancement function is turned off.
- the first electronic device 100 may display a user interface 330 as shown in FIG. 3D .
- a prompt box 336 may be included in the user interface 330 .
- the prompt box 336 may include a text prompt "The microphone of this device will be turned on while recording the video".
- the prompt box 336 may be used to prompt the user that when the sound source enhancement function is enabled, the first electronic device 100 will turn on the microphone of the device during the video recording process.
- a determination control 336A may also be included in the prompt box 336 .
- the first electronic device 100 may display the user interface 330 as shown in FIG. 3E. At this time, the sound source enhancement function is turned on.
- the above-mentioned sound source enhancement function is invalid. That is, the user cannot use the above-mentioned sound source enhancement function to enhance the sound quality in the recorded video.
- the first electronic device 100 may display the user interface 320 as shown in FIG. 3F .
- the first electronic device 100 may start recording.
- both the left ear TWS earphone 201 and the right ear TWS earphone 202 are connected to the first electronic device 100 .
- the user wears the left ear TWS earphone 201 and the right ear TWS earphone 202 , and uses the front camera of the first electronic device 100 to perform video recording.
- the audio signal in the video comes from the first audio signal collected by the microphone of the first electronic device 100 , the second audio signal collected by the microphone of the left TWS earphone 201 , and the third audio signal collected by the microphone of the right TWS earphone 202 .
- the first electronic device 100 can process the first audio signal, the second audio signal, and the third audio signal to enhance the target sound and suppress the ambient noise, thereby improving the sound quality of the recorded video.
- the first electronic device 100 when a connection is established between the TWS headset and the first electronic device 100, the first electronic device 100 can directly turn on the microphone to collect sound when detecting a user operation to start recording. That is to say, when the user wears the TWS headset and uses the first electronic device 100 to record a video, the above-mentioned sound source enhancement function may not be manually turned on.
- the first electronic device 100 can automatically implement the above-mentioned sound source enhancement function when the user wears the TWS headset and uses the first electronic device 100 to record a video, so as to provide the sound quality in the recorded video.
- the electronic device 100 since the second audio signal and the third audio signal are transmitted to the first audio signal
- the electronic device 100 requires a certain time, and there are time delays of different time lengths between the first audio signal, the second audio signal, and the third audio signal.
- the first electronic device 100 may perform time delay alignment on the first audio signal, the second audio signal, and the third audio signal to obtain a joint microphone array signal.
- the first electronic device may perform spatial filtering on the joint microphone array signal.
- the above spatial filtering requires position information of a first microphone array composed of the microphone of the first electronic device 100 , the microphone of the left TWS earphone 201 , and the microphone of the right TWS earphone 202 . Since the left ear TWS earphone 201 and the right ear TWS earphone 202 are respectively worn on the left ear and the right ear of the user, the first electronic device 100 can use the face image collected by the front camera and its own posture information to determine the above-mentioned first microphone array The location information of each microphone in .
- the implementation methods for the above-mentioned first electronic device to perform time delay alignment, determine the position information of each microphone in the first microphone array, and perform spatial filtering on the signal of the joint microphone array will be described in detail in subsequent embodiments.
- the embodiments of the present application do not limit the user operations mentioned above and in subsequent embodiments.
- the user can instruct the first electronic device 100 to execute the instruction corresponding to the control by touching the position of the control on the display screen (such as enabling the function of sound source enhancement, starting video recording, etc.).
- the following describes a first microphone array and a sound collection system for sound collection provided by an embodiment of the present application in conjunction with the sound collection scene in the foregoing embodiment.
- the sound collection system may include a first electronic device 100 , a left ear TWS earphone 201 , and a right ear TWS earphone 202 .
- the first microphone array may include a first microphone 211 , a second microphone 230 , and a third microphone 220 .
- the first microphone 211 is a microphone of the first electronic device 100 .
- the second microphone 230 is the microphone of the left ear TWS earphone 201 .
- the third microphone 220 is the microphone of the right ear TWS earphone 202 .
- the first electronic device 100 may further include a speaker 210, a camera 213, an attitude sensor 214, a digital-to-analog converter (DAC) 215A, an analog-to-digital converter (ADC) 215B, ADC215C, digital Signal processor (digital signal processor, DSP) 216.
- DAC digital-to-analog converter
- ADC analog-to-digital converter
- DSP digital Signal processor
- Speaker 210 may be used to convert audio electrical signals into sound signals.
- the first electronic device 100 may play sound through the speaker 210 .
- the camera 213 may be the front camera in the camera 193 in the foregoing embodiment. Camera 213 can be used to capture still images or video.
- the attitude sensor 214 may be used to measure attitude information of the first electronic device 100 .
- the attitude sensor 214 may include a gyro sensor, an acceleration sensor.
- the DAC215A can be used to convert digital audio signals to analog audio signals.
- the ADC215B can be used to convert analog audio signals into digital audio signals.
- ADC215C can be used to convert analog image signals to digital image signals.
- the DSP 216 can be used to process digital signals. For example, digital image signal, digital audio signal.
- the DSP 216 may include a signal generator 216A, a time delay alignment module 216B, a coordinate calculation module 216C, and a spatial filtering module 216D. in:
- Signal generator 216A may be used to generate alignment signals.
- the alignment signal is a digital audio signal. Since the frequency band of the audio signal that can be heard by the human ear is generally in the range of 20 hertz (Hz) to 20000 Hz, the above-mentioned alignment signal may be an audio signal with a frequency in the frequency range above 20000 Hz. In this way, the aligned signal can avoid the audible frequency band and avoid interference to the user.
- the alignment signal can be used by the first electronic device 100 to perform time delay alignment on the audio signals collected by the microphones in the first microphone array. Wherein, when the first electronic device 100 detects a user operation to start recording, the signal generator 216A can generate an alignment signal, and send the alignment signal to the above-mentioned DAC 215A.
- the embodiment of the present application does not limit the frequency of the above-mentioned alignment signal, and the alignment signal may also be an audio signal with a frequency of 20000 Hz or a frequency range below 20000 Hz.
- the DAC 215A can convert the alignment signal to an analog signal and send it to the speaker 210 . Then, the speaker 210 may emit an alignment sound.
- the alignment sound is the sound corresponding to the above alignment signal.
- the first microphone 211, the second microphone 230, and the third microphone 220 can all hear the above-mentioned alignment sound.
- the signal generator 216A may also send the alignment signal to the delay alignment modulo 216B.
- the time delay alignment module 216B can use the above-mentioned alignment signal to perform time delay alignment on the first audio signal collected by the first microphone 211, the second audio signal collected by the second microphone 230, and the third audio signal collected by the third microphone 220 to obtain a joint.
- Microphone array signal The above joint microphone array signal is a matrix. The first row, the second row and the third row of the matrix may be the first audio signal, the second audio signal and the third audio signal that have undergone time delay alignment processing, respectively.
- the coordinate calculation module 216C can be used to determine the coordinates of each microphone in the first microphone array in the world coordinate system.
- Each microphone in the first microphone array is located in a different electronic device.
- the positions of the microphones in the first microphone array are different according to the distance and posture of the user wearing the TWS headset relative to the first electronic device 100 .
- the coordinate calculation module 216C can calculate the coordinates of each microphone in the first microphone array in the world coordinate system by using the face image collected by the front camera (ie the camera 213 ) and the attitude information of the first electronic device 100 .
- the camera 213 can send the collected image to the ADC 215C.
- the ADC215C can convert the image into a digital signal to obtain the image signal shown in Figure 4.
- ADC 215C may send the image signal to coordinate resolution module 216C.
- the attitude sensor 214C may send the collected attitude information of the first electronic device 100 to the coordinate calculation module 216C.
- the coordinate calculation module 216C can obtain the coordinate information of the first microphone array.
- the coordinate information of the first microphone array includes the coordinates of each microphone in the first microphone array in the world coordinate system.
- the spatial filtering module 216D may be used to spatially filter the aforementioned joint microphone array signal. Based on the coordinate information of the above-mentioned first microphone array, the spatial filtering module 216D can enhance the target sound from the user in the above-mentioned joint microphone array signal, and suppress environmental noise. The spatial filtering module 216D may output the resulting audio signal. The resulting audio signal is the spatially filtered joint microphone array signal.
- the spatial filtering module 216D may perform spatial filtering on the time-delayed first audio signal, second audio signal, and third audio signal together. Specifically, the spatial filtering module 216D may use multiple rows of data in the same column in the joint microphone array as a group of data, and perform spatial filtering on each group of data. Further, the spatial filtering module can output the spatially filtered signal. This signal is the resultant audio signal described above.
- the spatial filter 216D may perform spatial filtering on the first audio signal, the second audio signal and the third audio signal that have undergone time delay alignment processing respectively. Specifically, the spatial filter 216D may perform spatial filtering on the first row in the joint microphone array (ie, the first audio signal subjected to time delay alignment processing) to obtain the first audio channel. The spatial filter 216D may perform spatial filtering on the second row in the joint microphone array (ie, the second audio signal subjected to time delay alignment processing) to obtain the second channel of audio. The spatial filter 216D may perform spatial filtering on the third row in the joint microphone array (ie, the third audio signal subjected to time delay alignment processing) to obtain the third audio channel. The first electronic device 100 may combine the first audio channel, the second audio channel, and the third audio channel into one audio channel to obtain the resulting audio signal.
- the first electronic device 100 may mix the above-mentioned result audio signal with the video collected by the camera and save it locally or upload it to a cloud server.
- the left-ear TWS earphone 201 may further include ADC231A.
- ADC231A can be used to convert analog audio signals to digital audio signals.
- the second microphone 230 may transmit the heard sound (ie, an analog audio signal) to the ADC 231A.
- the ADC231A can convert the sound into a second audio signal (ie, a digital audio signal).
- the sound heard by the second microphone 230 may include alignment sound, target sound from the user (eg, the user's voice, the sound of the user playing a musical instrument), ambient noise, and the like.
- the second audio signal may include the above-mentioned alignment signal, the target sound signal from the user, and the ambient noise signal.
- the left TWS earphone 201 can send the above-mentioned second audio signal to the first electronic device 100 through Bluetooth.
- the ADC 221A may also be included in the right ear TWS earphone 202 .
- ADC221A can be used to convert analog audio signals to digital audio signals.
- the third microphone 220 may transmit the heard sound (ie, an analog audio signal) to the ADC 221A.
- the ADC221A can convert the sound into a third audio signal (ie, a digital audio signal).
- the sound heard by the third microphone 220 may include alignment sound, target sound from the user (eg, the user's voice, the sound of the user playing a musical instrument), ambient noise, and the like.
- the third audio signal may include the above-mentioned alignment signal, the target sound signal from the user, and the ambient noise signal.
- the right-ear TWS earphone 202 can send the above-mentioned third audio signal to the first electronic device 100 through Bluetooth.
- the near-field area formed by the first microphone array includes the user wearing the TWS headset and the area where the first electronic device 100 is located.
- the size of the first microphone array is larger and the spatial resolution ability is stronger, which can more accurately distinguish the target sound from the user in the near-field area and the environment from the far-field area noise.
- the target sound can be better enhanced, the environmental noise can be suppressed, and the sound quality of the recorded video can be improved.
- adding a microphone in the first electronic device 100 to the microphone array can reduce the influence of the user's posture of wearing the TWS headset on the first electronic device 100 to enhance the target sound and reduce environmental noise.
- the following specifically introduces a method for time delay alignment of an audio signal provided by an embodiment of the present application.
- the microphones in the first microphone array are distributed on different electronic devices.
- the first electronic device 100 may process the audio signal collected by the first microphone array, and mix the processed audio signal with the video collected by the camera.
- the audio signal collected by the microphone in the TWS headset needs to be sent to the first electronic device 100 through Bluetooth.
- the above signal transmission can introduce delay errors on the order of hundreds of milliseconds.
- the time at which each microphone in the first microphone array starts to collect sound may be inconsistent, and the time at which the first electronic device 100 and the TWS headset process the audio signal collected by the respective microphones may also be inconsistent.
- the above factors may cause the first electronic device 100 to receive There is a time delay error between the received first audio signal, the second audio signal and the third audio signal.
- the first electronic device 100 may generate an alignment signal, convert the alignment signal into an analog audio sound effect, and then emit an alignment sound from a speaker.
- the first electronic device may use the above-mentioned alignment signal to determine the time extension between the audio signals collected by the microphones.
- the first electronic device 100 may emit an alignment sound.
- the first microphone, the second microphone and the third microphone can collect the first audio signal, the second audio signal and the third audio signal respectively.
- the first audio signal, the second audio signal, and the third audio signal may each include an alignment signal, a target sound signal from a user, and an ambient noise signal.
- the delay alignment module 216B of the first electronic device 100 may receive the first audio signal 502 , the second audio signal 503 , the third audio signal 504 and the alignment signal 501 from the signal generator 216A. Among them, the data in the alignment signal 501 corresponding to the alignment sound emitted at time T0, the data collected by the first microphone at time T0 in the first audio signal 502, and the data collected by the second microphone at time T0 in the second audio signal 503 . In the third audio signal 504, the data collected by the third microphone at time T0 are not aligned.
- the time delay alignment module 216B can align the first audio signal 502, the second audio signal 503 and the third audio signal 503
- the audio signals 504 are all time-delay aligned with the alignment signal 501 .
- the delay alignment module 216B may perform high-pass filtering on the first audio signal 501 , the second audio signal 502 , and the third audio signal 503 to obtain an alignment signal among these audio signals.
- the time delay alignment module 216B may perform time delay correlation detection on the alignment signals in the audio signals and the alignment signal 501 to obtain the time extension between the alignment signals.
- the time extension between these alignment signals is the time extension between the first audio signal 502 , the second audio signal 503 , the third audio signal 504 and the alignment signal 501 .
- the time delay alignment module 216B can perform time delay processing of different time lengths on the alignment signals in the first audio signal 501 , the second audio signal 502 , and the third audio signal 503 . Then, the delay alignment module 216B may compare the correlation between the above-mentioned time-delayed alignment signal and the above-mentioned alignment signal 501 .
- the delay alignment module 216B may perform high-pass filtering on the first audio signal 502 to obtain a signal 512 including the alignment signal.
- the delay alignment module 216B may compare the correlation between the alignment signal 501 and the signal 512 starting at different time instants.
- the alignment signal 501 may be, for example, an audio signal with a time length of 1 second.
- the time delay alignment module 216B can determine that the correlation of the part of the alignment signal 501 and the signal 512 starting at time ⁇ t1 and having a time length of 1 second is the highest. Then, the time delay alignment module can determine that the time extension between the first audio signal 502 and the above-mentioned alignment signal 501 is ⁇ t1.
- the delay alignment module 216B may perform high-pass filtering on the second audio signal 503 to obtain a signal 513 including the alignment signal.
- the delay alignment module 216B may compare the correlation between the alignment signal 501 and the signal 513 starting at different times.
- the alignment signal 501 may be, for example, an audio signal with a time length of 1 second.
- the time delay alignment module 216B can determine that the correlation of the part of the alignment signal 501 and the signal 513 starting at time ⁇ t2 and having a time length of 1 second is the highest. Then, the time delay alignment module can determine that the time extension between the second audio signal 503 and the above-mentioned alignment signal 501 is ⁇ t2.
- the delay alignment module 216B may perform high-pass filtering on the third audio signal 504 to obtain a signal 514 including the alignment signal.
- Delay alignment module 216B may compare the correlation between alignment signal 501 and signals 514 that start at different times.
- the alignment signal 501 may be, for example, an audio signal with a time length of 1 second.
- the delay alignment module 216B can determine that the correlation between the alignment signal 501 and the signal 514 starts at time ⁇ t3 and the time length is 1 second has the highest correlation. Then, the time delay alignment module can determine that the time extension between the third audio signal 504 and the above-mentioned alignment signal 501 is ⁇ t3.
- the above-mentioned alignment signal 501 may be all the alignment signals generated by the signal generator 216A.
- the above-mentioned alignment signal 501 may be a part of all the alignment signals generated by the signal generator 216B.
- the alignment signal generated by the signal generator 216B is a periodic signal.
- Signal generator 216B may generate multiple cycles of the alignment signal.
- the time delay alignment module 216B can perform time delay correlation detection on the above-mentioned signal 512, signal 513 and signal 514 by using a period of alignment signal to determine the time delay between the first audio signal, the second audio signal and the third audio signal Spend.
- the above-mentioned time delay correlation detection may be a correlation detection method such as a cross-correlation time estimation method.
- the embodiments of the present application do not limit the method for the delay correlation detection performed by the delay alignment module 216B.
- the time delay alignment module 216B can use the time when the audio signal is received to subtract the time extension degree corresponding to the audio signal to determine that the microphones in the audio signals belong to the first microphone array at the same time collected data.
- the delay alignment module 216B may output the joint microphone array signal.
- the joint microphone array signal may contain multiple audio signals.
- the plurality of audio signals are audio signals obtained by filtering out the alignment signals of the first audio signal 502 , the second audio signal 503 , and the third audio signal 504 .
- the data corresponding to the same time is the data collected by the microphones in the first microphone array at the same time. In this way, the delay alignment module 216B can eliminate the delay error caused by signal transmission.
- FIG. 5A and FIG. 5B are only for explaining the present application and should not be construed as a limitation.
- the delay alignment module 216B may also use any one of the first audio signal 502 , the second audio signal 503 and the third audio signal 504 as the reference audio signal, and align other audio signals with the reference audio signal Perform time-delay alignment.
- the delay alignment module 216B may use the first audio signal 502 as the reference audio signal.
- the time delay alignment module 216B may perform high-pass filtering on the first audio signal 502 , the second audio signal 503 and the third audio signal 504 to obtain an alignment signal portion in these audio signals.
- the time delay alignment module 216B may perform time delay processing of different time lengths on the aligned signal parts in the second audio signal 503 and the third audio signal 504 to determine at which time extension the second audio signal 503 and the third audio signal 503 are extended.
- the alignment signal portion in the audio signal 504 has the highest correlation with the alignment signal portion in the first audio signal 501 . In this way, the time delay alignment module 216B can determine the time extension of the second audio signal 503 and the third audio signal 504 relative to the first audio signal 502 .
- the first electronic device 100 may generate the above-mentioned alignment signal during a preset time period for starting recording (such as a preset time period before starting recording or a preset time period after starting recording), and emit an alignment sound.
- the first electronic device 100 may determine the time extension between the first audio signal, the second audio signal, and the third audio signal according to the alignment signal of the preset time period.
- the above-mentioned degree of elongation generally does not change or changes very little during the recording process of the first electronic device 100 .
- the first electronic device 100 may perform time delay alignment on the first audio signal, the second audio signal and the third audio signal according to the above time extension degree.
- the first electronic device 100 may stop generating the above-mentioned alignment signal, so as to save the power consumption of the first electronic device 100 .
- a method for determining coordinate information of a first microphone array provided by an embodiment of the present application is specifically described below.
- the first electronic device 100 may perform spatial filtering on the above joint microphone array signal to enhance the target sound from the user and suppress ambient noise. In the above process of spatial filtering, the first electronic device 100 needs to determine the direction of the target sound signal and the direction of the ambient noise by determining the coordinate information of the first microphone array.
- the microphones in the first microphone array are respectively distributed in the TWS earphone and the first electronic device 100 .
- the above-mentioned TWS earphones are usually worn on the left and right ears of the user.
- the first electronic device 100 may determine the coordinate information of the above-mentioned first microphone array by using the face image collected by the front camera and its own posture information.
- the first electronic device 100 can determine the coordinates of each microphone in the first microphone array in the coordinate system of the first electronic device.
- the first electronic device 100 may first determine the transformation relationship between the three-dimensional (3 dimensions, 3D) human head coordinate system and the coordinate system of the first electronic device according to the face image. Then, the first electronic device 100 may determine the microphone of the TWS headset in the first microphone array by converting the coordinates of the user's left and right external auricles in the 3D human head coordinate system into coordinates in the first electronic device coordinate system Coordinates in the first electronic device coordinate system.
- the above-mentioned 3D human head coordinate system may be determined according to a standard human head model. Each key point on the human face can correspond to a certain coordinate in the 3D human head coordinate system.
- the 3D human head coordinate system x h -y h -z h may take the position of the nose tip of the standard human head as the origin, the direction perpendicular to the human face as the direction of the x-axis, and the direction parallel to the human face.
- the horizontal direction is the direction of the y-axis
- the vertical direction parallel to the face is the direction of the z-axis.
- the three-dimensional coordinate system is established.
- the first electronic device 100 may store data related to the 3D human head coordinate system x h -y h -z h . This embodiment of the present application does not limit the method for establishing the above-mentioned 3D human head coordinate system.
- the first electronic device 100 can determine the extrinsic parameter matrix through the relationship between the coordinates of the key points of the face image in the pixel coordinate system and the coordinates of the corresponding key points in the 3D human head coordinate system.
- the above pixel coordinate system is a two-dimensional coordinate system, which can be used to reflect the arrangement of pixels in an image.
- the pixel coordinate system may be a two-dimensional coordinate system established with any pixel of the image as the origin and the directions parallel to both sides of the image plane as the directions of the x-axis and the y-axis.
- the above extrinsic parameter matrix can be used to describe the transformation relationship between the 3D human head coordinate system x h -y h -z h and the first electronic device coordinate system x d -y d -z d .
- the face image collected by the front camera of the first electronic device 100 may include N key points.
- the N key points are respectively p 1 , p 2 , ..., p i , ..., p N . i is a positive integer less than or equal to N.
- the N key points may be any N key points in the area where the face is located in the face image, such as the key points in the area where the forehead area is located, the key points in the area where the cheeks are located, and the key points in the area where the lips are located.
- the embodiments of the present application do not limit the number of the above-mentioned key points.
- the above key point detection algorithm may be, for example, a method of determining the coordinates of key points by using a trained neural network model. The embodiments of the present application do not limit the above key point detection algorithm.
- the coordinates of the above-mentioned N key points in the pixel coordinate system and the coordinates in the 3D human head coordinate system may have the relationship in the following formula (1):
- C is the internal parameter matrix.
- the internal parameter matrix can be used to describe the transformation relationship between the pixel coordinate system and the coordinate system x d -y d -z d of the first electronic device.
- the internal parameter matrix is only related to the parameters of the camera of the first electronic device 100 .
- the internal parameter matrix can be obtained by the method of camera calibration. For the specific implementation method of the above camera calibration, reference may be made to the camera calibration method in the prior art, which will not be repeated here.
- the above-mentioned internal parameter matrix C may be stored in the first electronic device 100 .
- R is the rotation matrix.
- T is the center offset vector. The above R and T together form the extrinsic parameter matrix [R
- the first electronic device 100 may solve the above-mentioned external parameter matrix according to the coordinates of the above-mentioned N key points in the pixel coordinate system and the coordinates in the 3D human head coordinate system. in:
- the above ⁇ , ⁇ , and ⁇ can be respectively the deflection angles between the 3D human head coordinate system x h -y h -z h and the coordinate axes of the first electronic device coordinate system x d -y d -z d .
- the x, y, and z in the above-mentioned center offset T can be respectively the x d axis, y d axis, y d axis, Offset on the z- d axis.
- the above-mentioned extrinsic parameter matrix may be stored in the first electronic device 100 .
- the first electronic device 100 may determine the coordinates of the key points where the outer auricles of the left ear and the right ear are located in the above-mentioned 3D human head coordinate system.
- the first electronic device 100 may use the above h L and h R as the coordinates of the second microphone 230 of the left TWS earphone 201 and the third microphone 220 of the right TWS earphone 202 in the above-mentioned 3D human head coordinate system, respectively.
- the first electronic device 100 can convert the coordinates of the second microphone 230 and the third microphone 220 in the above 3D human head coordinate system Converted to coordinates in the first electronic device coordinate system.
- the first electronic device 100 may store the coordinate e 1 of the first microphone 211 in the first electronic device coordinate system in the foregoing embodiment.
- the first electronic device 100 can determine the transformation relationship between the coordinate system of the first electronic device and the world coordinate system, and calculate the coordinates of each microphone in the first microphone array in the world coordinate system.
- the pose of the first electronic device 100 is unknown and time-varying.
- the target sound to be collected by the first microphone array not only includes the user's voice, but also includes the user's sound of playing a musical instrument.
- Using the coordinates of each microphone in the first microphone array in the coordinate system of the first electronic device in the process of spatial filtering by the first electronic device will reduce the effect of spatial filtering.
- the first electronic device 100 may convert the coordinates of each microphone in the first microphone array in the coordinate system of the first electronic device into coordinates in the world coordinate system.
- the first electronic device 100 may acquire the posture information of the first electronic device 100 through the posture sensor 214 .
- the attitude signal may include deflection angles ⁇ ', ⁇ ' between the coordinate axes of the first electronic device coordinate system x d -y d -z d and the world coordinate system x w -y w -z w , ⁇ '.
- E' and E can have the relationship shown in the following formula (4):
- the above process of determining the coordinate information of the first microphone array may be completed by the coordinate calculation module 216C shown in FIG. 4 .
- the first electronic device 100 may perform spatial filtering on the joint microphone array signal.
- the first electronic device 100 may first perform voice activity detection on the above-mentioned joint microphone array signal. Voice activity detection can be used to distinguish between the signal with the intermediate frequency of the joint microphone array signal on the target sound and the signal with the frequency point on the ambient noise. Further, the first electronic device 100 can update the spatial characteristic of the noise according to the signal of the above-mentioned frequency point on the environmental noise.
- the first electronic device 100 can estimate the target steering vector according to the signal of the above-mentioned frequency points on the target sound. Both the noise space characteristics and the target steering vector described above are parameters used to determine the spatial filter.
- Noise spatial feature updates can be used to reduce the effect of target sounds on the spatial filter to suppress ambient noise.
- the target steering vector estimation can be used to reduce the influence of the environmental noise at the same frequency as the target sound on the environmental noise suppression of the spatial filter, and to improve the effect of the spatial filter.
- VAD Voice activity detection
- the sound collected by the first microphone array in the near-field area mainly includes the user's voice and the sound of the user playing a musical instrument. That is, the target sound signals to be enhanced in the above-mentioned joint microphone array signal mainly include voice signals and sound signals of musical instruments. According to the difference between the characteristics of the voice signal and the sound signal of the musical instrument and the characteristics of the environmental noise signal, the first electronic device 100 can use the neural network model to perform voice activity detection on the joint microphone array signal, so as to distinguish the intermediate frequency point of the joint microphone array signal in the The target sound signal on the target sound and the ambient noise signal with the frequency point on the ambient noise.
- the above-mentioned neural network model for voice activity detection can be obtained by training the voice signal and the sound signal of the musical instrument.
- the trained neural network model can be used to distinguish the target sound signal from the ambient noise signal in the joint microphone array signal. For example, a voice signal whose frequency is on the voice or the sound of a musical instrument or the sound signal of a musical instrument is input into a trained neural network model, and the trained neural network model can output the label 1.
- the above label 1 may indicate that the input received by the trained neural network model is a target sound signal with a frequency point on the target sound. Input the environmental noise signal of the frequency point on the environmental noise into the trained neural network model, and the trained neural network model can output the label 0.
- the above label 1 may indicate that the input received by the trained neural network model is not a target sound signal (eg, an environmental noise signal).
- the first electronic device 100 may store the above trained neural network model.
- the embodiments of the present application do not limit the above method for training the neural network model.
- the training method of the neural network model reference may be made to the specific implementation method in the prior art, which will not be repeated here.
- the above-mentioned neural network model may be a convolutional neural network model, a deep neural network model, or the like.
- the embodiment of the present application does not limit the type of the neural network model.
- the first electronic device 100 may use the above-mentioned joint microphone array signal and the coordinate information of the first microphone array as the input of the above-mentioned trained neural network model.
- the trained neural network model can adapt to the influence of the coordinate information change of the first microphone array on the voice activity detection.
- the trained neural network model can better distinguish the target sound signal and the environmental noise signal in the signal of the joint microphone array.
- the first electronic device 100 can use the detection result of the above voice activity detection to update the noise spatial characteristic at each frequency point in the joint microphone array signal, so as to reduce the interference of the target sound signal on the noise spatial characteristic of the ambient noise.
- the noise spatial characteristic can be represented by a noise covariance matrix.
- the first electronic device 100 may update the noise covariance matrix according to the detection result of the voice activity detection. Specifically, the short time Fourier transform (short time fourier transform, STFT) of the joint microphone array signal at time t and frequency point f is X t (f). The noise covariance matrix at the previous time instant at time t is R t-1 (f). The first electronic device 100 performs voice activity detection on the joint microphone array signal. Among them, the detection result of the joint microphone array signal at the frequency point f is vad(f). The first electronic device 100 may update the noise covariance matrix at time t according to the following formula (5):
- R t (f) is the noise covariance matrix at time t.
- fac is the smoothing factor. fac is greater than or equal to 0 and less than or equal to 1.
- the first electronic device 100 can smoothly update the noise control characteristics according to whether the joint microphone array signal at frequency f is a target sound signal or an environmental noise signal. This can reduce the sudden change of ambient noise and the influence of target sound on the suppression of ambient noise by the spatial filter.
- the first electronic device 100 may also adopt other methods to update the above-mentioned noise space characteristic.
- the first electronic device 100 may estimate the target steering vector according to the coordinate information of the first microphone array and the sound propagation model.
- the target steering vector can be used to indicate the direction of the target sound signal.
- the target steering vector may be determined by different delay times of the target sound signal reaching each microphone in the first microphone array.
- the first electronic device 100 may further utilize the subspace projection method to improve the accuracy of the above-mentioned target steering vector. Improving the accuracy of the target steering vector helps the first electronic device 100 to more accurately distinguish the direction of the target sound signal and the direction of the ambient noise signal.
- the above estimation of the target steering vector can be used to reduce the influence of the environmental noise at the same frequency as the target sound on the environmental noise suppression of the spatial filter, and to improve the effect of the spatial filter.
- Spatial filtering can be used to process multi-channel microphone signals (i.e. joint microphone array signals), suppress signals in non-target directions (i.e. ambient noise signals) and enhance signals in target directions (i.e. target sound signals).
- the first electronic device 100 can utilize a minimum variance distortionless response (minimum variance distortionless response, MVDR) beamforming algorithm, a linearly constrained minimum variance (linearly constrained minimum variance, LCMV) beamforming algorithm, a generalized sidelobe canceller (generalized sidelobe canceller) , GSC) and other methods to determine the spatial filter.
- MVDR minimum variance distortionless response
- LCMV linearly constrained minimum variance
- GSC generalized sidelobe canceller
- the embodiments of the present application do not limit the specific method for determining the spatial filter.
- the first electronic device 100 may use the MVDR beamforming algorithm to determine the control filter.
- the principle of this method is to select appropriate filter parameters under the constraint that the desired signal has no distortion, so as to minimize the average power output by the joint microphone array signal.
- the first electronic device 100 may determine the optimal spatial filter by using the above-mentioned noise spatial characteristic and the target steering vector.
- the optimal spatial filter can minimize the influence of the environmental noise signal under the constraint that the target steering vector passes through without distortion.
- the spatial filter can be designed according to the following formula (6):
- w(f) is the optimal filtering weight coefficient of the spatial filter.
- R t (f) and at (f) are the noise covariance matrix and target steering vector of frequency point f at time t , respectively.
- the first electronic device 100 may input the joint microphone array signal into the optimal spatial filter to perform spatial filtering. After the spatial filter, the first electronic device 100 can obtain the resultant audio signal.
- the resulting audio signal is an audio signal obtained by enhancing the target sound signal in the combined microphone array signal and suppressing the environmental noise.
- the above process of performing spatial filtering may be completed by the spatial filtering module 216D shown in FIG. 4 .
- the following describes a method for enabling the first microphone array provided by an embodiment of the present application.
- the first microphone array may include a first microphone in the first electronic device 100 , a second microphone in the left TWS headset 201 , and a third microphone in the right TWS headset 202 .
- both the left ear TWS earphone 201 and the right ear TWS earphone 202 establish a Bluetooth connection with the first electronic device 100 . It is not limited to the connection through Bluetooth, and the TWS headset and the first electronic device 100 can also establish a communication connection through other communication methods.
- FIG. 7A exemplarily shows a flow chart of a method of turning on the first microphone array. As shown in FIG. 7A, the method may include steps S101-S110. in:
- the first electronic device 100 receives a user operation for enabling video recording.
- the user may use the first electronic device 100 to record video in a scene where a video is recorded using the front camera, such as live broadcast, shooting vlog, or the like.
- the first electronic device 100 may receive a user operation to start recording.
- the above-mentioned user operation for starting recording may be, for example, the user operation acting on the recording start control 326 as shown in FIG. 3F .
- the front camera of the first electronic device 100 is in an on state before receiving the above-mentioned user operation to start recording.
- the first electronic device 100 may display an image captured by the front camera in real time in the preview frame 323 shown in FIG. 3F.
- the left ear TWS earphone 201 performs wearing detection.
- the right ear TWS earphone 202 performs wearing detection.
- both the left ear TWS earphone 201 and the right ear TWS earphone 202 can include a wearing detection module.
- the wearing detection module can be used to detect whether the user wears the TWS headset.
- the above-mentioned wearing detection module may include a temperature sensor.
- the TWS earphone can obtain the temperature of the earpiece surface of the TWS earphone through the proximity light sensor.
- the TWS earphone can determine that the user is wearing the TWS earphone.
- the TWS headset can be worn by the temperature sensor.
- the TWS headset can wake up the main processor to realize functions such as playing music and collecting sounds.
- the TWS headset can control the main processor and components such as speakers and microphones to be in a sleep state. This saves the power consumption of the TWS headset.
- the TWS headset can also be worn through a proximity light sensor, a motion sensor, a pressure sensor, and the like.
- the embodiments of the present application do not limit the method for detecting the wearing of the TWS headset.
- the first electronic device 100 determines that the user has worn the right ear TWS headset 202 .
- the first electronic device 100 determines that the user has worn the left ear TWS earphone 201 .
- the first electronic device 100 may send a message to the left TWS earphone 201 and the right TWS earphone 202 to inquire about the wearing detection result.
- the left-ear TWS earphone 201 and the right-ear TWS earphone 202 can send the wearing detection result to the first electronic device 100 .
- the first electronic device 100 may determine that the user has worn the left-ear TWS earphone 201 and the right-ear TWS earphone 202 according to the received wearing detection result.
- This embodiment of the present application does not limit the execution order of the foregoing steps S104 and S105.
- the first electronic device 100 turns on the first microphone.
- the first electronic device 100 can turn on the first microphone to collect sound.
- the first electronic device 100 sends an instruction to turn on the microphone to the right ear TWS earphone 202 .
- the first electronic device 100 sends an instruction to turn on the microphone to the left TWS earphone 201 .
- the first electronic device 100 may also send an instruction to turn on the microphone to the left-ear TWS earphone 201 .
- the first electronic device 100 may also send an instruction to turn on the microphone to the right-ear TWS earphone 202 .
- This embodiment of the present application does not limit the execution order of the foregoing step S106, step S107, and step S108.
- the left TWS earphone 201 turns on the second microphone.
- the left TWS earphone 201 can turn on the second microphone.
- the right ear TWS earphone 202 turns on the third microphone.
- the right-ear TWS earphone 202 can turn on the third microphone.
- the microphones used for sound collection may include the first microphone, the second microphone, and the third microphone. . That is, the first microphone is turned on. Not limited to the above video recording scenario, the scenario where the first microphone is turned on may also be a scenario where the user is wearing the left TWS earphone 201 and the right TWS earphone 202 and uses the first electronic device 100 to make a video call.
- the left-ear TWS earphone 201 may turn on the second microphone when the wearing detection is performed and it is determined that the left-ear TWS earphone 201 is in the in-ear state.
- the right-ear TWS earphone 202 may turn on the third microphone when the wearing detection is performed and it is determined that the right-ear TWS earphone 202 is in the in-ear state. That is, the first electronic device 100 may not need to send an instruction to turn on the microphone to the left TWS earphone 201 and the right TWS earphone 202 .
- FIG. 7B exemplarily shows a flow chart of a method for sound collection.
- the method may include steps S201-S210.
- steps S201 to S207 are the process of sound collection performed by the first microphone array.
- steps S208 to S210 are processes in which the first electronic device 100 processes the audio signal collected by the first microphone array.
- the first microphone array performs sound collection.
- the first electronic device 100 receives a user operation to start recording.
- step S101 in FIG. 7A For the above-mentioned user operation of starting the recruitment, reference may be made to the description of step S101 in FIG. 7A , which will not be repeated here.
- the first electronic device 100 emits an alignment sound.
- the left ear TWS earphone 201 collects a second audio signal through the second microphone, where the second audio signal includes an alignment signal, a target sound signal from the user, and an environmental noise signal.
- the right ear TWS earphone 202 collects a third audio signal through the third microphone, where the third audio signal includes an alignment signal, a target sound signal from the user, and an environmental noise signal.
- the first electronic device 100 collects a first audio signal through the first microphone, where the first audio signal includes an alignment signal, a target sound signal from a user, and an environmental noise signal.
- the sounds near the first electronic device 100 , the left TWS earphone 201 and the right TWS earphone 202 may include sounds from the alignment, the user's voice, and the user's playing musical instruments. sound and ambient noise.
- the target sound collected by the first microphone array includes the voice of the user and the sound of the user playing a musical instrument.
- the above target sound is the sound that is expected to be retained in the recorded video and is recorded more clearly.
- the above-mentioned ambient noises are sounds that are not expected to be recorded.
- the first microphone, the second microphone, and the third microphone can all collect sounds near the above-mentioned first electronic device 100 , the left TWS earphone 201 , and the right TWS earphone 202 .
- the first microphone, the second microphone, and the third microphone can process the sound collected by themselves by a related module (such as ADC) that processes audio, and obtain the first audio signal, the second audio signal, and the third audio signal respectively. .
- the left TWS earphone 201 may send the second audio signal to the first electronic device 100 through Bluetooth.
- the right ear TWS earphone 202 may send the third audio signal to the first electronic device 100 through Bluetooth.
- the first electronic device 100 processes the audio signal collected by the first microphone array.
- the first electronic device 100 performs time delay alignment on the first audio signal, the second audio signal, and the third audio signal to obtain a joint microphone array signal.
- the first electronic device 100 determines the coordinate information of the first microphone array in combination with the face image collected by the front camera and the posture information of the first electronic device 100 .
- the first electronic device 100 uses a spatial filter to perform spatial filtering on the joint microphone array signal according to the coordinate information of the first microphone array.
- the first electronic device 100 may mix the resultant audio signal with the video captured by the camera during the process from the start of recording to the end of recording.
- the first electronic device 100 can determine the difference between the audio signal and the video according to the time when any one or more microphones in the first microphone array start to collect sound and the time when the camera of the first electronic device 100 starts to collect images. time extension. According to the time extension of the resultant audio signal and the video, the first electronic device 100 can ensure that the resultant audio signal and the video are time-aligned when mixing the resultant audio signal and the video.
- the embodiments of the present application do not limit the method for the first electronic device 100 to perform time delay alignment processing on the resultant audio signal and video.
- the first electronic device 100 may save the mixed audio and video data locally or upload it to a cloud server.
- the first electronic device 100 may determine the coordinate information of the first microphone array by using the face image collected within a preset time period after the recording starts and the posture information of the first electronic device 100 .
- the first electronic device 100 may store the coordinate information of the first microphone array, and use the coordinate information of the first microphone array when performing spatial filtering on the joint microphone array signal in the video recording process. That is to say, the first electronic device 100 does not need to repeatedly measure the coordinate information of the first microphone array in one video recording process. In this way, the power consumption of the first electronic device 100 can be saved.
- the distance and direction between the user wearing the TWS headset and the first electronic device 100 generally do not change much.
- the first electronic device 100 uses the coordinate information of the first microphone array determined within the preset time period after the start of recording as the coordinate information of the first microphone array in this recording process, which has little effect on enhancing the target sound and suppressing environmental noise. .
- the first electronic device 100 may determine the coordinate information of the first microphone array by using the face image collected within a preset period of time after the recording starts and the posture information of the first electronic device 100 . In the subsequent stage of this video recording, the first electronic device 100 may judge the distance and direction between the user and the first electronic device 100 at regular intervals. If it is determined that the variation of the distance and direction between the user and the first electronic device 100 exceeds the preset variation, the first electronic device 100 can obtain the face image and the posture of the first electronic device 100 according to the current front-facing camera. The information re-determines the coordinate information of the first microphone array. Further, the first electronic device 100 may perform spatial filtering on the joint microphone array signal by using the re-determined coordinate information of the first microphone array.
- the first electronic device 100 can continue to use the currently stored coordinate information of the first microphone array to perform the joint microphone array signal analysis. Spatial filtering. The above method not only reduces the number of times the first electronic device 100 determines the coordinate information of the first microphone array, saves the power consumption of the first electronic device 100, but also reduces the effect of coordinate information changes of the first microphone array on spatial filtering during the video recording process. influences.
- the first electronic device 100 may determine changes in the distance and direction between the user and the first electronic device 100 by detecting changes in the size and position of the face frame in the image captured by the front camera.
- the first electronic device 100 may also determine the amount of distance and direction transformation between the user and the first electronic device 100 by means of a proximity optical sensor, sound wave ranging, or the like. This embodiment of the present application does not limit this.
- the first electronic device 100 when the first electronic device 100 records a video when a communication connection is established with the TWS headset, it can still turn on its own microphone to collect sound.
- the microphone of the first electronic device 100 and the microphone of the TWS headset may form a first microphone array.
- the near-field area formed by the first microphone array includes the user wearing the TWS headset and the area where the first electronic device 100 is located.
- the size of the first microphone array is larger and the spatial resolution ability is stronger, which can more accurately distinguish the target sound from the user in the near-field area and the target sound from the far-field area. ambient noise.
- the first electronic device 100 performs spatial filtering on the audio signal collected by the first microphone array, the target sound can be better enhanced, the environmental noise can be suppressed, and the sound quality of the recorded video can be improved.
- adding a microphone in the first electronic device 100 to the microphone array can reduce the influence of the user's posture of wearing the TWS headset on the first electronic device 100 to enhance the target sound and reduce environmental noise.
- the sound collection method provided by the present application is particularly applicable to a scenario where a user wearing a TWS headset uses the first electronic device 100 to live broadcast, shoot a vlog, etc. to perform video recording. Not limited to the above scenarios of video recording, the sound collection method provided in this application can also be applied to other scenarios such as video calls.
- the microphone of one of the left ear TWS earphone 201 and the right ear TWS earphone 202 may form a microphone array with the microphone of the first electronic device 100.
- the microphone array can also be applied to the sound collection method provided in the embodiment of the present application.
- the first microphone of the first electronic device 100 and the second microphone of the left TWS headset 201 may Do sound collection.
- the first electronic device 100 can obtain the first audio signal through the first microphone.
- the left-ear TWS earphone 201 can obtain the second audio signal through the second microphone.
- the left ear TWS earphone 201 can send the second audio signal to the first electronic device 100 .
- the first electronic device 100 may perform noise reduction processing on the first audio signal and the second audio signal to obtain the audio signal in the live video.
- the microphone of the above-mentioned one earphone and the microphone of the first electronic device 100 may form a near-field area of the microphone array, which may still include the location of the sound source of the target sound, such as the user's voice and the sound of the user playing a musical instrument. Also, using a headset's microphone can save the power consumption of the headset.
- the image captured by the camera of the first electronic device 100 includes multiple human faces. That is, multiple users jointly use the first electronic device 100 to record video. One of the multiple users is wearing a TWS headset. A communication connection is established between the TWS headset and the first electronic device 100 .
- the near-field area of the first microphone array formed by the microphone of the first electronic device 100 and the microphone of the TWS headset may generally cover the areas where the multiple users are located. Then, the sound in the near-field area of the first microphone array may include both the voices of the multiple users and the sounds of the multiple users playing musical instruments.
- the first electronic device 100 when the first electronic device 100 performs spatial filtering on the combined microphone array signal collected by the first microphone array, it can not only enhance the voice of the user wearing the TWS earphone and the sound of the user playing the musical instrument, but also can enhance the voice of the user wearing the TWS headset. Enhances the voice of other users recording and the sound of other users playing instruments. This can improve the sound quality of videos recorded by multiple people.
- FIG. 8 exemplarily shows a schematic structural diagram of a first electronic device 100 .
- the first electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, Antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194 , and a subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
- SIM subscriber identification module
- the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
- the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the first electronic device 100 .
- the first electronic device 100 may include more or less components than shown, or some components are combined, or some components are separated, or different components are arranged.
- the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
- the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
- application processor application processor, AP
- modem processor graphics processor
- graphics processor graphics processor
- ISP image signal processor
- controller memory
- video codec digital signal processor
- DSP digital signal processor
- NPU neural-network processing unit
- a memory may also be provided in the processor 110 for storing instructions and data.
- the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
- the charging management module 140 is used to receive charging input from the charger.
- the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
- the power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 .
- the wireless communication function of the first electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
- Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
- Each antenna in the first electronic device 100 may be used to cover a single or multiple communication frequency bands.
- the mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G etc. applied on the first electronic device 100 .
- the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
- the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
- the modem processor may include a modulator and a demodulator.
- the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
- the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal.
- the modem processor may be a stand-alone device. In other embodiments, the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.
- the wireless communication module 160 can provide applications on the first electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global Navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
- WLAN wireless local area networks
- BT wireless fidelity
- GNSS global Navigation satellite system
- frequency modulation frequency modulation, FM
- NFC near field communication technology
- IR infrared technology
- the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
- the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
- the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the
- the first electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
- the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
- the GPU is used to perform mathematical and geometric calculations for graphics rendering.
- Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
- Display screen 194 is used to display images, videos, and the like.
- the first electronic device 100 may include 1 or N display screens 194 , where N is a positive integer greater than 1.
- the first electronic device 100 may implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
- the ISP is used to process the data fed back by the camera 193 .
- the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
- the ISP may be provided in the camera 193 .
- Camera 193 is used to capture still images or video.
- the object is projected through the lens to generate an optical image onto the photosensitive element.
- the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
- CMOS complementary metal-oxide-semiconductor
- the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
- the ISP outputs the digital image signal to the DSP for processing.
- DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
- the first electronic device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
- the aforementioned camera 213 shown in FIG. 4 is the front camera in the camera 193 in FIG. 8 .
- the image captured by the camera 213 can be converted into a digital image signal by the ADC 215C and output to the DSP.
- the aforementioned ADC215C may be an analog-to-digital converter integrated in the aforementioned ISP.
- DSP is used to process digital signals, such as digital image signals, digital audio signals, and other digital signals.
- digital signals such as digital image signals, digital audio signals, and other digital signals.
- the digital signal processor is used to perform Fourier transform on the energy of the frequency point, and the like.
- the DSP may include a signal generator, a delay alignment module, a coordinate calculation module, and a spatial filtering module.
- the functions of the above-mentioned signal generator, time-delay alignment module, coordinate calculation module, and spatial filtering module may be introduced with reference to the foregoing embodiment in FIG. 4 , and will not be repeated here.
- the above-mentioned signal generator, time delay alignment module, coordinate calculation module and spatial filtering module may also be integrated in other chip processors individually or jointly. This embodiment of the present application does not limit this.
- Video codecs are used to compress or decompress digital video.
- the first electronic device 100 may support one or more video codecs.
- the first electronic device 100 can play or record videos in various encoding formats, such as: moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
- the NPU is a neural-network (NN) computing processor.
- NN neural-network
- Applications such as intelligent cognition of the first electronic device 100 can be implemented through the NPU, for example: image recognition, face recognition, speech recognition, voice activity detection, and the like.
- the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the first electronic device 100.
- an external memory card such as a Micro SD card
- Internal memory 121 may be used to store computer executable program code, which includes instructions.
- the processor 110 executes various functional applications and data processing of the first electronic device 100 by executing the instructions stored in the internal memory 121 .
- the internal memory 121 may include a storage program area and a storage data area.
- the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
- the storage data area may store data (such as audio data) and the like created during the use of the first electronic device 100 .
- the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
- the first electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
- the audio module 170 is used for converting digital audio signals into analog audio signals for output, and for converting analog audio inputs into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
- Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
- the first electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
- the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
- the voice can be answered by placing the receiver 170B close to the human ear.
- the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
- the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
- the first electronic device 100 may be provided with at least one microphone 170C.
- the first electronic device 100 may be provided with two microphones 170C, which may implement a noise reduction function in addition to collecting sound signals.
- the first electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
- the above-mentioned microphone 170C is the first microphone 211 in the foregoing embodiment.
- the earphone jack 170D is used to connect wired earphones.
- the signal generator of the first electronic device 100 may generate the alignment signal.
- the alignment signal is a digital audio signal.
- the signal generator can send this alignment signal to the DAC215A.
- the DAC215A can convert this alignment signal to an analog audio signal.
- the above-mentioned DAC215A may be integrated in the above-mentioned audio module 170 .
- the sound collected by the first microphone 211 is an analog audio signal.
- the first microphone 211 can send the collected sound to the ADC 215B.
- the ADC215B can convert this analog audio signal to a digital audio signal.
- the above ADC215B may be integrated in the above audio module 170 .
- the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
- the pressure sensor 180A may be provided on the display screen 194 .
- the gyro sensor 180B may be used to determine the motion attitude of the first electronic device 100 .
- the angular velocity of the first electronic device 100 about three axes ie, the x, y and z axes
- the gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyro sensor 180B detects the shaking angle of the first electronic device 100, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to offset the shaking of the first electronic device 100 through reverse motion, Achieve anti-shake.
- the gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
- the air pressure sensor 180C is used to measure air pressure.
- the magnetic sensor 180D includes a Hall sensor.
- the first electronic device 100 can detect the opening and closing of the flip holster by using the magnetic sensor 180D.
- the acceleration sensor 180E can detect the magnitude of the acceleration of the first electronic device 100 in various directions (generally three axes).
- the magnitude and direction of gravity can be detected when the first electronic device 100 is stationary. It can also be used for recognizing the posture of the first electronic device 100, and can be used in applications such as switching between horizontal and vertical screens, and pedometers.
- the above-mentioned gyro sensor 180B and the above-mentioned acceleration sensor 180E may be the attitude sensor 214 in the foregoing embodiment.
- the first electronic device 100 may measure the distance through infrared or laser. In some embodiments, when shooting a scene, the first electronic device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
- Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
- the light emitting diodes may be infrared light emitting diodes.
- the first electronic device 100 emits infrared light to the outside through the light emitting diode.
- the first electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it may be determined that there is an object near the first electronic device 100 . When insufficient reflected light is detected, the first electronic device 100 may determine that there is no object near the first electronic device 100 .
- the first electronic device 100 can use the proximity light sensor 180G to detect that the user holds the first electronic device 100 close to the ear to talk, so as to automatically turn off the screen to save power.
- the ambient light sensor 180L is used to sense ambient light brightness.
- the first electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
- the fingerprint sensor 180H is used to collect fingerprints.
- the first electronic device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.
- the temperature sensor 180J is used to detect the temperature.
- the first electronic device 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy.
- Touch sensor 180K also called “touch panel”.
- the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
- the touch sensor 180K is used to detect a touch operation on or near it.
- the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
- Visual output related to touch operations may be provided through display screen 194 .
- the bone conduction sensor 180M can acquire vibration signals.
- the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
- Motor 191 can generate vibrating cues.
- the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
- the SIM card interface 195 is used to connect a SIM card.
- the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact with and separation from the first electronic device 100 .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Otolaryngology (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
Claims (15)
- 一种声音采集方法,其特征在于,所述方法应用于具有第一麦克风的第一电子设备、具有第二麦克风的左耳无线耳机、具有第三麦克风的右耳无线耳机,所述第一电子设备与所述左耳无线耳机、所述右耳无线耳机通过无线通信连接;所述方法包括:所述第一电子设备采集到人脸图像;所述第一电子设备基于所述人脸图像和所述第一电子设备的姿态信息确定所述第一麦克风、所述第二麦克风、所述第三麦克风的相对位置;所述第一电子设备获取所述第一麦克风的第一音频信号、所述第二麦克风的第二音频信号和所述第三麦克风的第三音频信号;所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行降噪处理。
- 根据权利要求1所述的方法,其特征在于,所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行降噪处理之前,所述方法还包括:所述第一电子设备发出对齐声音,所述对齐声音由对齐信号经过数模转换得到;所述第一电子设备对所述第一音频信号中的第一对齐信号部分、所述第二音频信号中的第二对齐信号部分和所述第三音频信号中的第三对齐信号部分进行时延相关性检测,确定所述第一音频信号、所述第二音频信号和所述第三音频信号之间的时延长度;所述第一电子设备基于所述时延长度,对所述第一音频信号、所述第二音频信号、所述第三音频信号进行时延对齐。
- 根据权利要求2所述的方法,其特征在于,所述对齐信号为频率高于20000Hz的音频信号。
- 根据权利要求1-3中任一项所述的方法,其特征在于,所述第一电子设备根据所述人脸图像和所述第一电子设备的姿态信息确定所述第一麦克风、所述第二麦克风、所述第三麦克风的相对位置,具体包括:所述第一电子设备基于第一人脸关键点在标准人头坐标系中的坐标与在人脸图像坐标系中的坐标的对应关系,确定所述标准人头坐标系与第一电子设备坐标系的第一转换关系;所述标准人头坐标系根据标准人头模型确定,所述第一电子设备存储有所述标准人头模型中各关键点在所述标准人头坐标系中的坐标;所述第一电子设备基于所述第一转换关系以及所述标准人头模型中左耳与右耳在所述标准人头坐标系中的坐标,确定所述标准人头模型中左耳与右耳在所述第一电子设备坐标系中的坐标,所述标准人头模型中左耳与右耳在所述第一电子设备坐标系中的坐标分别为所述第二麦克风与所述第三麦克风在所述第一电子设备坐标系中的坐标;所述第一电子设备根据所述第一电子设备的姿态信息,确定所述第一电子设备坐标系与世界坐标系之间的第二转换关系;所述第一电子设备基于所述第二转换关系,所述第一麦克风、所述第二麦克风、所述第三麦克风在所述第一电子设备坐标系的坐标,确定所述相对位置,所述相对位置包括所述第一麦克风、所述第二麦克风、所述第三麦克风在所述世界坐标系的坐标。
- 根据权利要求1-4中任一项所述的方法,其特征在于,所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行降噪处理,具体包括:所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行语音活动检测,所述语音活动检测用于确定所述第一音频信号、所述第二音频信号和所述第三音频信号中目标声音信号的频点和环境噪声信号的频点;所述目标声音信号为位于所述第一麦克风、所述第二麦克风、所述第三麦克风组成的麦克风阵列的近场区域的声源的声音信号;所述第一电子设备基于所述目标声音信号的频点和所述环境噪声信号的频点更新所述环境噪声的噪声空间特性,所述噪声空间特性用于指示所述环境噪声在空间的分布,所述环境噪声在空间的分布包括所述环境噪声的方向和能量;所述第一电子设备基于所述相对位置,确定所述第一音频信号、所述第二音频信号和所述第三音频信号的目标导向矢量,所述目标导向矢量用于指示所述目标声音信号的方向;所述第一电子设备基于所述噪声空间特性和所述目标导向矢量确定空域滤波器,并利用所述空域滤波器对所述第一音频信号、所述第二音频信号和所述第三音频信号进行空域滤波。
- 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:所述左耳无线耳机进行佩带检测,所述佩带检测用于确定无线耳机是否处于入耳状态;在所述左耳无线耳机处于入耳状态的情况下,所述左耳无线耳机利用所述第二麦克风得到所述第二音频信号;所述右耳无线耳机进行所述佩带检测;在所述右耳无线耳机处于入耳状态的情况下,所述右耳无线耳机利用所述第三麦克风得到所述第三音频信号。
- 根据权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:所述第一电子设备将在第一时间段内采集的第一视频与第四音频信号混合,所述第四音频信号为所述第一音频信号、所述第二音频信号和所述第三音频信号经过所述降噪处理后的音频信号;所述第一音频信号、所述第二音频信号、所述第三音频信号分别为在所述第一时间段内通过所述第一麦克风、所述第二麦克风和所述第三麦克风得到的。
- 一种声音采集方法,所述方法应用于具有第一麦克风的第一电子设备,所述第一电子设备与具有第二麦克风的左耳无线耳机、具有第三麦克风的右耳无线耳机通过无线通信连接,其特征在于,所述方法包括:所述第一电子设备采集得到人脸图像;所述第一电子设备基于所述人脸图像和所述第一电子设备的姿态信息确定所述第一麦克风、所述第二麦克风、所述第三麦克风的相对位置;所述第一电子设备获取所述第一麦克风的第一音频信号、所述第二麦克风的第二音频信 号和所述第三麦克风的第三音频信号;所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行降噪处理。
- 根据权利要求8所述的方法,其特征在于,所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行降噪处理之前,所述方法还包括:所述第一电子设备发出对齐声音,所述对齐声音由对齐信号经过数模转换得到;所述第一电子设备对所述第一音频信号中的第一对齐信号部分、所述第二音频信号中的第二对齐信号部分和所述第三音频信号中的第三对齐信号部分进行时延相关性检测,确定所述第一音频信号、所述第二音频信号和所述第三音频信号之间的时延长度;所述第一电子设备基于所述时延长度,对所述第一音频信号、所述第二音频信号、所述第三音频信号进行时延对齐。
- 根据权利要求9所述的方法,其特征在于,所述对齐信号为频率高于20000Hz的音频信号。
- 根据权利要求8-10中任一项所述的方法,其特征在于,所述第一电子设备根据所述人脸图像和所述第一电子设备的姿态信息确定所述第一麦克风、所述第二麦克风、所述第三麦克风的相对位置,具体包括:所述第一电子设备基于第一人脸关键点在标准人头坐标系中的坐标与在人脸图像坐标系中的坐标的对应关系,确定所述标准人头坐标系与第一电子设备坐标系的第一转换关系;所述标准人头坐标系根据标准人头模型确定,所述第一电子设备存储有所述标准人头模型中各关键点在所述标准人头坐标系中的坐标;所述第一电子设备基于所述第一转换关系以及所述标准人头模型中左耳与右耳在所述标准人头坐标系中的坐标,确定所述标准人头模型中左耳与右耳在所述第一电子设备坐标系中的坐标,所述标准人头模型中左耳与右耳在所述第一电子设备坐标系中的坐标分别为所述第二麦克风与所述第三麦克风在所述第一电子设备坐标系中的坐标;所述第一电子设备根据所述第一电子设备的姿态信息,确定所述第一电子设备坐标系与世界坐标系之间的第二转换关系;所述第一电子设备基于所述第二转换关系,所述第一麦克风、所述第二麦克风、所述第三麦克风在所述第一电子设备坐标系的坐标,确定所述相对位置,所述相对位置包括所述第一麦克风、所述第二麦克风、所述第三麦克风在所述世界坐标系的坐标。
- 根据权利要求8-11中任一项所述的方法,其特征在于,所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行降噪处理,具体包括:所述第一电子设备基于所述相对位置,对所述第一音频信号、所述第二音频信号和所述第三音频信号进行语音活动检测,所述语音活动检测用于确定所述第一音频信号、所述第二音频信号和所述第三音频信号中目标声音信号的频点和环境噪声信号的频点;所述目标声音信号为位于所述第一麦克风、所述第二麦克风、所述第三麦克风组成的麦克风阵列的近场区 域的声源的声音信号;所述第一电子设备基于所述目标声音信号的频点和所述环境噪声信号的频点更新所述环境噪声的噪声空间特性,所述噪声空间特性用于指示所述环境噪声在空间的分布,所述环境噪声在空间的分布包括所述环境噪声的方向和能量;所述第一电子设备基于所述相对位置,确定所述第一音频信号、所述第二音频信号和所述第三音频信号的目标导向矢量,所述目标导向矢量用于指示所述目标声音信号的方向;所述第一电子设备基于所述噪声空间特性和所述目标导向矢量确定空域滤波器,并利用所述空域滤波器对所述第一音频信号、所述第二音频信号和所述第三音频信号进行空域滤波。
- 根据权利要求8-12中任一项所述的方法,其特征在于,所述方法还包括:所述第一电子设备将在第一时间段内采集的第一视频与第四音频信号混合,所述第四音频信号为所述第一音频信号、所述第二音频信号和所述第三音频信号经过所述降噪处理后的音频信号;所述第一音频信号、所述第二音频信号、所述第三音频信号为在所述第一时间段内采集得到的音频信号。
- 一种电子设备,其特征在于,所述电子设备包括通信装置、摄像头、麦克风、存储器和处理器,其中:所述通信装置用于与无线耳机建立通信连接;所述摄像头用于采集图像;所述麦克风用于进行声音采集;所述存储器用于存储标准人头坐标系,还用于存储计算机程序;所述处理器用于调用所述计算机程序,使得所述电子设备执行如权利要求8-13中任一项所述的方法。
- 一种计算机存储介质,其特征在于,包括:计算机指令;当所述计算机指令在电子设备上运行时,使得所述电子设备执行权利要求1-13中任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21913860.9A EP4258685A4 (en) | 2020-12-29 | 2021-12-13 | SOUND COLLECTION METHOD, ELECTRONIC DEVICE AND SYSTEM |
| US18/259,528 US12413885B2 (en) | 2020-12-29 | 2021-12-13 | Sound collecting method, electronic device, and system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011593358.1 | 2020-12-29 | ||
| CN202011593358.1A CN114697812B (zh) | 2020-12-29 | 2020-12-29 | 声音采集方法、电子设备及系统 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022143119A1 true WO2022143119A1 (zh) | 2022-07-07 |
Family
ID=82132180
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/137406 Ceased WO2022143119A1 (zh) | 2020-12-29 | 2021-12-13 | 声音采集方法、电子设备及系统 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12413885B2 (zh) |
| EP (1) | EP4258685A4 (zh) |
| CN (1) | CN114697812B (zh) |
| WO (1) | WO2022143119A1 (zh) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115361636A (zh) * | 2022-08-15 | 2022-11-18 | Oppo广东移动通信有限公司 | 声音信号调整方法、装置、终端设备及存储介质 |
| CN116962935A (zh) * | 2023-09-20 | 2023-10-27 | 深圳市齐奥通信技术有限公司 | 一种基于数据分析的耳机降噪方法及系统 |
| WO2024016998A1 (zh) * | 2022-07-22 | 2024-01-25 | 比亚迪股份有限公司 | 基于双向通信的配对方法、影音娱乐系统 |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12170097B2 (en) * | 2022-08-17 | 2024-12-17 | Caterpillar Inc. | Detection of audio communication signals present in a high noise environment |
| CN115589566A (zh) * | 2022-09-26 | 2023-01-10 | 北京小米移动软件有限公司 | 音频聚焦方法及装置、存储介质及电子设备 |
| EP4594855A1 (en) * | 2022-09-30 | 2025-08-06 | Sonos, Inc. | Generative audio playback via wearable playback devices |
| CN116668892B (zh) * | 2022-11-14 | 2024-04-12 | 荣耀终端有限公司 | 音频信号的处理方法、电子设备及可读存储介质 |
| CN116320851A (zh) * | 2023-01-17 | 2023-06-23 | 泰凌微电子(上海)股份有限公司 | 麦克风阵列降噪方法、装置、系统、电子设备及存储介质 |
| CN118042329B (zh) * | 2024-04-11 | 2024-07-02 | 深圳波洛斯科技有限公司 | 基于会议场景的多麦克风阵列降噪方法及其系统 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2015184893A1 (zh) * | 2014-11-21 | 2015-12-10 | 中兴通讯股份有限公司 | 移动终端通话语音降噪方法及装置 |
| CN106960670A (zh) * | 2017-03-27 | 2017-07-18 | 联想(北京)有限公司 | 一种录音方法和电子设备 |
| CN108196769A (zh) * | 2017-12-29 | 2018-06-22 | 上海爱优威软件开发有限公司 | 一种语音消息发送方法及终端设备 |
Family Cites Families (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8855341B2 (en) * | 2010-10-25 | 2014-10-07 | Qualcomm Incorporated | Systems, methods, apparatus, and computer-readable media for head tracking based on recorded sound signals |
| JP6069829B2 (ja) * | 2011-12-08 | 2017-02-01 | ソニー株式会社 | 耳孔装着型収音装置、信号処理装置、収音方法 |
| CN105848062B (zh) * | 2015-01-12 | 2018-01-05 | 芋头科技(杭州)有限公司 | 多声道的数字麦克风 |
| KR102262218B1 (ko) * | 2015-01-28 | 2021-06-09 | 삼성전자주식회사 | 이어잭 인식 방법 및 이를 지원하는 전자 장치 |
| AU2016218989B2 (en) * | 2015-02-13 | 2020-09-10 | Noopl, Inc. | System and method for improving hearing |
| TWI823334B (zh) * | 2016-10-24 | 2023-11-21 | 美商艾孚諾亞公司 | 使用多個麥克風的自動噪音消除 |
| KR102535726B1 (ko) * | 2016-11-30 | 2023-05-24 | 삼성전자주식회사 | 이어폰 오장착 검출 방법, 이를 위한 전자 장치 및 저장 매체 |
| KR20180108155A (ko) * | 2017-03-24 | 2018-10-04 | 삼성전자주식회사 | 바람 소리가 조정된 신호를 출력하는 방법 및 전자 장치 |
| CN107452375A (zh) * | 2017-07-17 | 2017-12-08 | 湖南海翼电子商务股份有限公司 | 蓝牙耳机 |
| CN108305637B (zh) * | 2018-01-23 | 2021-04-06 | Oppo广东移动通信有限公司 | 耳机语音处理方法、终端设备及存储介质 |
| CN110085247B (zh) * | 2019-05-06 | 2021-04-20 | 上海互问信息科技有限公司 | 一种针对复杂噪声环境的双麦克风降噪方法 |
| CN110191388A (zh) | 2019-05-31 | 2019-08-30 | 深圳市荣盛智能装备有限公司 | 骨传导耳机降噪方法、装置、电子设备及存储介质 |
| CN110121129B (zh) | 2019-06-20 | 2021-04-20 | 歌尔股份有限公司 | 耳机的麦克风阵列降噪方法、装置、耳机及tws耳机 |
| CN110933555A (zh) * | 2019-12-19 | 2020-03-27 | 歌尔股份有限公司 | 一种tws降噪耳机及其降噪方法和装置 |
| CN113284504B (zh) * | 2020-02-20 | 2024-11-08 | 北京三星通信技术研究有限公司 | 姿态检测方法、装置、电子设备及计算机可读存储介质 |
| CN111402913B (zh) * | 2020-02-24 | 2023-09-12 | 北京声智科技有限公司 | 降噪方法、装置、设备和存储介质 |
| CN210868121U (zh) | 2020-02-25 | 2020-06-26 | 广州疯酷科技有限公司 | 一种双麦克风降噪tws蓝牙耳机 |
| CN111970626B (zh) * | 2020-08-28 | 2022-03-22 | Oppo广东移动通信有限公司 | 录音方法和装置、录音系统和存储介质 |
-
2020
- 2020-12-29 CN CN202011593358.1A patent/CN114697812B/zh active Active
-
2021
- 2021-12-13 US US18/259,528 patent/US12413885B2/en active Active
- 2021-12-13 WO PCT/CN2021/137406 patent/WO2022143119A1/zh not_active Ceased
- 2021-12-13 EP EP21913860.9A patent/EP4258685A4/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2015184893A1 (zh) * | 2014-11-21 | 2015-12-10 | 中兴通讯股份有限公司 | 移动终端通话语音降噪方法及装置 |
| CN106960670A (zh) * | 2017-03-27 | 2017-07-18 | 联想(北京)有限公司 | 一种录音方法和电子设备 |
| CN108196769A (zh) * | 2017-12-29 | 2018-06-22 | 上海爱优威软件开发有限公司 | 一种语音消息发送方法及终端设备 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4258685A4 |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024016998A1 (zh) * | 2022-07-22 | 2024-01-25 | 比亚迪股份有限公司 | 基于双向通信的配对方法、影音娱乐系统 |
| CN115361636A (zh) * | 2022-08-15 | 2022-11-18 | Oppo广东移动通信有限公司 | 声音信号调整方法、装置、终端设备及存储介质 |
| CN116962935A (zh) * | 2023-09-20 | 2023-10-27 | 深圳市齐奥通信技术有限公司 | 一种基于数据分析的耳机降噪方法及系统 |
| CN116962935B (zh) * | 2023-09-20 | 2024-01-30 | 深圳市齐奥通信技术有限公司 | 一种基于数据分析的耳机降噪方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4258685A1 (en) | 2023-10-11 |
| CN114697812A (zh) | 2022-07-01 |
| US20240064449A1 (en) | 2024-02-22 |
| EP4258685A4 (en) | 2024-05-01 |
| US12413885B2 (en) | 2025-09-09 |
| CN114697812B (zh) | 2023-06-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114697812B (zh) | 声音采集方法、电子设备及系统 | |
| CN111050269B (zh) | 音频处理方法和电子设备 | |
| CN110191241B (zh) | 一种语音通信方法及相关装置 | |
| WO2020192458A1 (zh) | 一种图像处理的方法及头戴式显示设备 | |
| CN113393856B (zh) | 拾音方法、装置和电子设备 | |
| WO2021052214A1 (zh) | 一种手势交互方法、装置及终端设备 | |
| WO2022068613A1 (zh) | 音频处理的方法及电子设备 | |
| WO2022100610A1 (zh) | 投屏方法、装置、电子设备及计算机可读存储介质 | |
| CN111835907A (zh) | 一种跨电子设备转接服务的方法、设备以及系统 | |
| CN111741303B (zh) | 深度视频处理方法、装置、存储介质与电子设备 | |
| WO2023015940A1 (zh) | 防漏音的移动终端及移动终端的声音输出方法 | |
| CN115525366A (zh) | 一种投屏方法及相关装置 | |
| US20240144948A1 (en) | Sound signal processing method and electronic device | |
| WO2021143656A1 (zh) | 立体声拾音方法、装置、终端设备和计算机可读存储介质 | |
| CN113890936A (zh) | 音量调整方法、装置及存储介质 | |
| WO2022033344A1 (zh) | 视频防抖方法、终端设备和计算机可读存储介质 | |
| CN112037157A (zh) | 数据处理方法及装置、计算机可读介质及电子设备 | |
| WO2023197997A1 (zh) | 穿戴设备、拾音方法及装置 | |
| CN111161176A (zh) | 图像处理方法及装置、存储介质和电子设备 | |
| CN114339429A (zh) | 音视频播放控制方法、电子设备和存储介质 | |
| CN113436635B (zh) | 分布式麦克风阵列的自校准方法、装置和电子设备 | |
| CN115691531B (zh) | 一种音频信号的处理方法及装置 | |
| WO2022068505A1 (zh) | 一种拍摄方法和电子设备 | |
| CN115019803B (zh) | 音频处理方法、电子设备以及存储介质 | |
| WO2023030067A1 (zh) | 遥控方法、遥控设备和被控制设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21913860 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18259528 Country of ref document: US |
|
| ENP | Entry into the national phase |
Ref document number: 2021913860 Country of ref document: EP Effective date: 20230704 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 18259528 Country of ref document: US |


