WO2022042168A1 - 音频处理方法及电子设备 - Google Patents
音频处理方法及电子设备 Download PDFInfo
- Publication number
- WO2022042168A1 WO2022042168A1 PCT/CN2021/108458 CN2021108458W WO2022042168A1 WO 2022042168 A1 WO2022042168 A1 WO 2022042168A1 CN 2021108458 W CN2021108458 W CN 2021108458W WO 2022042168 A1 WO2022042168 A1 WO 2022042168A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- sound pickup
- pickup range
- video
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/62—Control of parameters via user interfaces
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers
- H04R3/005—Circuits for transducers for combining the signals of two or more microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
- H04N7/147—Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/478—Supplemental services, e.g. displaying phone caller identification, shopping application
- H04N21/4788—Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/61—Control of cameras or camera modules based on recognised objects
- H04N23/611—Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/63—Control of cameras or camera modules by using electronic viewfinders
- H04N23/631—Graphical user interfaces [GUI] specially adapted for controlling image capture or setting capture parameters
- H04N23/632—Graphical user interfaces [GUI] specially adapted for controlling image capture or setting capture parameters for displaying or modifying preview images prior to image capturing, e.g. variety of image resolutions or capturing parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/63—Control of cameras or camera modules by using electronic viewfinders
- H04N23/633—Control of cameras or camera modules by using electronic viewfinders for displaying additional information relating to control or operation of the camera
- H04N23/635—Region indicators; Field of view indicators
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/67—Focus control based on electronic image sensor signals
- H04N23/675—Focus control based on electronic image sensor signals comprising setting of focusing regions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/90—Arrangement of cameras or camera modules, e.g. multiple cameras in TV studios or sports stadiums
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/76—Television signal recording
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
- H04N7/142—Constructional details of the terminal equipment, e.g. arrangements of the camera and the display
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/02—Casings; Cabinets ; Supports therefor; Mountings therein
- H04R1/028—Casings; Cabinets ; Supports therefor; Mountings therein associated with devices performing functions other than acoustics, e.g. electric candles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
- H04R1/406—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/302—Electronic adaptation of stereophonic sound system to listener position or orientation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/40—Visual indication of stereophonic sound image
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/02—Details casings, cabinets or mounting therein for transducers covered by H04R1/02 but not provided for in any of its subgroups
- H04R2201/025—Transducer mountings or cabinet supports enabling variable orientation of transducer of cabinet
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/40—Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
- H04R2201/405—Non-uniform arrays of transducers or a plurality of uniform arrays with different transducer spacing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
- H04R2430/01—Aspects of volume control, not necessarily automatic, in sound systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; ELECTRIC HEARING AIDS; PUBLIC ADDRESS SYSTEMS
- H04R2499/00—Aspects covered by H04R or H04S not otherwise provided for in their subgroups
- H04R2499/10—General applications
- H04R2499/11—Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/15—Aspects of sound capture and related signal processing for recording or reproduction
Definitions
- the present application relates to the field of electronic technology, and in particular, to an audio processing method and an electronic device.
- a voice enhancement method is also proposed.
- the audio file collected by the electronic device is processed by an audio algorithm to remove noise.
- the processing capability of the audio algorithm is more demanding.
- the complex audio processing process will also increase the requirements for the hardware performance of electronic equipment.
- the audio processing method and electronic device provided by the present application can achieve directional speech enhancement by determining the position of the face or mouth of the person making the sound in the video picture, and determining the range that needs to be picked up according to the position of the person's face or mouth. It not only simplifies the audio processing algorithm, but also improves the audio quality.
- the present application provides an audio processing method, the method is applied to an electronic device, and the method may include: detecting a first operation of opening a camera application. In response to the first operation, a shooting preview interface is displayed. A second operation to start recording is detected. In response to the second operation, the video picture and the first audio are collected, and a shooting interface is displayed, and the shooting interface includes a preview interface of the video picture. Identify a target image in the video picture, where the target image is a first face image and/or a first human mouth image. Wherein, the first face image is the face image of the vocal object in the video image, and the first human mouth image is the human mouth image of the vocal object in the video image.
- the first sound pickup range corresponding to the sounding object is determined.
- the second audio corresponding to the video picture is obtained.
- the audio volume within the first sound pickup range in the second audio is greater than the audio volume outside the first sound pickup range.
- the method in the embodiment of the present application may be applied to a scenario in which a user instruction is received to directly start a camera application. It can also be applied to the scene where the user opens other third-party applications (such as short video applications, live broadcast applications, video calling applications, etc.) and invokes the startup of the camera.
- the first operation or the second operation includes, for example, a touch operation, a key operation, an air gesture operation, a voice operation, and the like.
- the method further includes: detecting a sixth operation of starting the voice enhancement mode.
- the speech enhancement mode is activated.
- the user is firstly asked whether to enable the voice enhancement mode. After the user confirms that the voice enhanced mode is turned on, the voice enhanced mode is activated. Or, automatically activate the voice enhancement mode after detecting the switch to the video recording function. In still other embodiments, after switching to the video recording function is detected, the video recording preview interface is displayed first, and after detecting the operation instructed by the user to shoot, the voice enhanced mode is activated according to the user's instruction, or the voice enhanced mode is automatically activated.
- the electronic device After starting the voice enhancement mode, the electronic device needs to process the collected first audio, identify the audio of the sounding object, and enhance this part of the audio to obtain a better recording effect.
- the first audio is, for example, the collected initial audio signal
- the second audio is audio obtained after voice enhancement processing.
- the first face image or the first human mouth image is identified through a face image recognition algorithm. For example, in the process of recording a video picture, it is determined by a face image recognition algorithm whether the collected video picture contains a face image. If a face image is included, the included face image is identified, and whether it is uttering is determined according to changes in the facial feature data of the face image, such as facial feature data, facial contour data, etc. within a preset time period. Wherein, the criterion for judging that the face image is uttering sound includes judging that the face image is currently uttering sound. Or, it is determined that the human face image is vocalizing again within a preset time period after it is determined that the human face image is vocalizing for the first time.
- the human vocal organ is the human mouth.
- the vocal human mouth data can be obtained, the data of the first human mouth image can be preferentially determined, and then the first sound pickup range can be determined based on the data of the first human mouth image.
- the image corresponding to the person who is speaking is not the target image. That is, the target image is the image corresponding to the recognized voice-producing face and/or the voice-producing mouth.
- the first sound pickup range that needs to be enhanced for sound pickup is determined. Further, based on the collected initial audio signal and the first sound pickup range, a second audio frequency is obtained. In the second audio frequency, the audio volume within the first sound pickup range is greater than the audio volume outside the first sound pickup range. That is, boost the volume of the person speaking, thereby improving the audio recording.
- determining the first sound pickup range corresponding to the sounding object according to the target image includes: obtaining the first feature value according to the target image.
- the first feature value includes one or more items of front and rear attribute parameters, area ratio, and position information.
- the front and rear attribute parameters are used to indicate whether the video image is a video image captured by the front camera or a video image captured by the rear camera.
- the area ratio is used to represent the ratio of the area of the target image to the area of the video screen.
- Position information used to indicate the position of the target image in the video picture. Then, according to the first feature value, the first sound pickup range corresponding to the sounding object is determined.
- the first feature value is used to describe the relative positional relationship between the face of the real person corresponding to the first face image and the electronic device, or the first feature value is used to describe the relationship between the human mouth of the real person corresponding to the first human mouth image and the electronic device.
- the relative positional relationship of electronic equipment. Therefore, the electronic device can determine the first sound pickup range according to the first feature value. For example, if the real person corresponding to the first face image is located directly in front of the electronic device, that is, the first face image is located in the center of the captured video image, the first sound pickup range is the sound pickup range directly in front of the electronic device. Subsequently, after the electronic device obtains the initial audio signal including the audio signals in various directions, the electronic device may obtain the audio corresponding to the first face image based on the initial audio signal and the first sound pickup range.
- the first feature value may change during the video recording process. Then, the first pickup range will also change accordingly. Then, for the audio in the recorded video, the audio recorded by the electronic device at least includes the audio of the first duration and the audio of the second duration.
- the first duration audio frequency is the audio frequency corresponding to the first sound pickup range
- the second duration audio frequency is the audio frequency corresponding to the changed sound pickup range. That is to say, the electronic device can dynamically determine the sound pickup range based on the change of the voice-emitting face or the voice-emitting mouth in the video picture, and then record the audio according to the sound pickup range.
- the audio of the formed video picture may include multiple audios of different durations or the same duration recorded based on the changed sound pickup range according to the time sequence.
- the electronic device can always focus on improving the audio recording quality of the part that needs to be enhanced according to the change of the sound pickup range, thereby ensuring the audio recording effect.
- the user when the user plays the video file, the user can be presented with a dynamically changing playing experience, such as a sound range that matches the change of the video content.
- determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: when the video picture is a front video picture, determining that the first sound pickup range is the sound pickup range on the front camera side. pickup range. When the video picture is a rear video picture, it is determined that the first sound pickup range is the sound pickup range on the rear camera side.
- the sound pickup range of the electronic device includes a sound pickup range of 180 degrees at the front and a sound pickup range of 180 degrees at the rear. Then, when it is determined that the video picture is the front video picture, the sound pickup range of 180 degrees in the front is used as the first sound pickup range. When it is determined that the video picture is the rear video picture, the sound pickup range of the rear 180 degrees is used as the first sound pickup range. Further, during the video recording process, in response to the user's operation of switching the front and rear cameras, the first sound pickup range will also be switched from the front to the rear, so as to ensure that the first sound pickup range is the sound pickup range corresponding to the sounding object in the video picture. .
- determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the first sound pickup range according to the area ratio and the sound pickup range of the first audio frequency.
- the sound pickup range of the first audio is, for example, the sound pickup range of panoramic audio.
- the microphones are used to collect the initial audio signals in all directions, that is, the initial audio signals within the sound pickup range of the panoramic audio are obtained.
- the person concerned by the user is usually placed at the center of the video image, that is, the first face image or the first human mouth image is located at the center of the viewfinder frame.
- the first face image or the first mouth image correspond to different sound pickup ranges, and the area ratio can be used to describe the size of the first sound pickup range. Such as radius, diameter, area, etc.
- X is used to represent the first face image area or the first human mouth image area.
- Y is used to indicate the area of the video frame displayed by the viewfinder.
- N represents the sound pickup range corresponding to the viewing range.
- the area ratio is X/Y
- the first pickup range is N*X/Y. That is to say, the ratio of the first sound pickup range to the panoramic sound pickup range is proportional to the area ratio.
- determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the position of the first sound pickup range in the sound pickup range of the first audio according to the position information.
- the sounding object is not located at the center of the video picture, and the position of the image corresponding to the sounding object (ie, the target image) in the video picture can be obtained according to the position information. It can be understood that there is a corresponding relationship between the position of the target image in the video picture and the position of the first sound pickup range in the panoramic sound pickup range.
- the position information includes a first offset of the center point of the target image relative to a first reference point, where the first reference point is the center point or the focus of the video image.
- determining the position of the first sound pickup range in the sound pickup range of the first audio frequency includes: determining, according to the first offset, a center point of the first sound pickup range relative to the sound pickup range of the first audio frequency.
- the second offset of the center point, the second offset is proportional to the first offset. Then, according to the second offset, the position of the first sound pickup range in the sound pickup range of the first audio frequency is determined.
- the offset amount includes, for example, the offset direction, and/or the offset angle, and/or the offset distance, and the like.
- the offset direction means that the center point of the first face image or the first mouth image is shifted leftward, rightwardly, upwardly, downwardly, upwardly left, and upward relative to the first reference point. Offset to the top right, offset to the bottom left, or offset to the bottom right, etc.
- the offset angle is the angle pointing to the upper left offset, the upper right offset, the lower left offset or the lower right offset.
- the offset distance is the distance that points to the left offset, the right offset, the upward offset, the downward offset, or the offset distance at a certain offset angle, etc.
- the first reference point is used as the origin
- the x-axis is parallel to the bottom edge of the mobile phone (or the bottom edge of the current viewfinder frame)
- the direction perpendicular to the x-axis is y to construct a coordinate system
- the current coordinate system is displayed parallel to the mobile phone. Screen.
- the constructed coordinate system is used to define the offset direction, offset angle and offset distance of the center point of the first face image or the first human mouth image relative to the first reference point. For example, if the position information of the target image is the lower left of the center point of the viewfinder frame, the first sound pickup range is in the panoramic sound pickup range, and the center point of the first sound pickup range is at the lower left of the center point of the panoramic sound pickup range.
- the center point of the video picture is the center point of the viewfinder frame, or the center point of the video picture is the center point of the display screen.
- the center point of the viewfinder frame is used as the first reference point, that is, the center point of the viewfinder frame is used to represent the center point of the video picture.
- the first reference point may also be represented in other forms.
- the center point of the entire screen of the display screen of the mobile phone is used to represent the center point of the video image, that is, as the first reference point.
- obtaining the second audio frequency corresponding to the video picture includes: enhancing the audio signal within the first sound pickup range in the first audio frequency, and/or attenuating the audio signal. For the audio signals in the first audio frequency outside the first sound pickup range, the second audio frequency is obtained.
- the first audio includes audio signals in various directions. After the first sound pickup range corresponding to the sound-emitting object is determined, the audio signals in the first sound pickup range are enhanced to improve the audio quality in the recorded video. Optionally, the audio signal outside the sound pickup range is further weakened to reduce the interference of external noise, and to highlight the sound emitted by the sounding object in the audio.
- the electronic device includes one or more microphones, and the one or more microphones are used to collect the first audio.
- Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when part or all of the first sound pickup range is included in the sound pickup range of the first microphone in the one or more microphones, executing The second audio is obtained by at least one of the following operations: enhancing the audio signal within the first sound pickup range in the sound pickup range of the first microphone; attenuating the audio signal outside the first sound pickup range in the sound pickup range of the first microphone; attenuating one or Audio signals of other microphones except the first microphone among the plurality of microphones.
- the mobile phone is configured with a microphone 1 and a microphone 2 .
- the first sound pickup range is within the sound pickup range of the microphone 1, then the mobile phone can enhance the audio signal within the first sound pickup range collected by the microphone 1 in the initial audio signal after using the microphone 1 and the microphone 2 to obtain the initial audio signal. , and at the same time attenuate the audio signal outside the first pickup range collected by the microphone 1 in the initial audio signal, and attenuate the audio signal collected by the microphone 2 to obtain the audio corresponding to the first face image or the first human mouth image.
- the mobile phone is configured with a microphone 1 and a microphone 2 .
- the first sound pickup range includes a sound pickup range 1 within the sound pickup range of the microphone 1 , and a sound pickup range 2 within the sound pickup range of the microphone 2 . That is to say, the first sound pickup range is the union of the sound pickup range 1 and the sound pickup range 2 . Then, after the mobile phone uses the microphone 1 and the microphone 2 to obtain the initial audio signal, it can enhance the audio signals within the sound pickup range 1 of the microphone 1 and the sound pickup range 2 of the microphone 2 in the initial audio signal, and weaken the remaining audio signals in the initial audio signal.
- the audio signal is used to obtain the audio corresponding to the first face image or the first human mouth image. It can be understood that the pickup range 1 and the pickup range 2 may overlap in whole or in part.
- the electronic device includes at least two microphones, and the at least two microphones are used to collect the first audio.
- Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when the sound pickup range of the second microphone in the at least two microphones does not include the first sound pickup range, turning off the second microphone, at least Audio collected by other microphones in the two microphones except the second microphone is the second audio.
- the mobile phone is configured with a microphone 1 and a microphone 2 .
- the first sound pickup range is within the sound pickup range of the microphone 1 and outside the sound pickup range of the microphone 2 .
- the mobile phone turns off the microphone 2, and processes the audio signal collected by the microphone 1 as the audio corresponding to the video image, that is, the audio corresponding to the first face image or the first mouth image is the audio collected by the microphone 1.
- the method when the second microphone is turned off, the method further includes: enhancing the audio signal within the first sound pickup range in the sound pickup range of other microphones in the at least two microphones except the second microphone, and /or attenuate audio signals outside the first sound pickup range in the sound pickup ranges of other microphones except the second microphone among the at least two microphones.
- the mobile phone is configured with a microphone 1 and a microphone 2 .
- the first sound pickup range is within the sound pickup range of the microphone 1 and outside the sound pickup range of the microphone 2 .
- the mobile phone turns off the microphone 2, enhances the audio signal within the first pickup range in the audio signal collected by the microphone 1, and attenuates the audio signal outside the first pickup range, and obtains the first face image or the first human mouth image. corresponding audio.
- the number of the first face images is one or more, and the number of the first human mouths is one or more.
- the number of first face images is one or more
- the number of first human mouth images is one or more. It is understandable that, if some characters are speaking in the currently captured video, but the mobile phone fails to recognize that they are speaking, the face image or mouth image of the unrecognized person who speaks is not classified as the above-mentioned No. A face image or a first mouth image.
- the first feature value needs to be determined based on multiple first face images or multiple first human mouth images. For example, in the process of determining the area ratio, the ratio of the area of the multiple first face images to the area of the video screen is used as the area ratio of the target image. For another example, in the process of determining the position information, the offset of the center point of the placeholder frame where the multiple first face images are located relative to the center point of the video image is used as the position information of the target image. Wherein, the placeholder frame where the multiple first face images are located is used to represent the smallest selection frame containing the multiple face images.
- the method further includes: detecting a third operation of stopping shooting. In response to the third operation, recording is stopped and a recorded video is generated; the recorded video includes a video picture and a second audio. Detect the fourth operation of playing the recorded video. In response to the fourth operation, a video playing interface is displayed, the video picture and the second audio are played.
- the electronic device determines the first sound pickup range according to the voice-emitting face image or the voice-emitting human mouth image, and then records audio according to the first voice pickup range. Subsequently, the recorded audio needs to be saved, and the user can play the video image and audio of the saved video.
- the scene of recording the video screen is a real-time communication scene such as live broadcast, video call, etc.
- the method of recording audio during the process of recording the video screen can refer to the above method, but when the user instructs to stop the shooting operation is detected. After the operation of stopping the communication, the communication is stopped directly without generating a recorded video. It is understandable that, in some real-time communication scenarios, the user may also choose to save the recorded video.
- the electronic device determines whether to save the recorded video in the real-time communication scene in response to the user's operation.
- the recorded video further includes third audio
- the third audio is audio determined according to the second sound pickup range
- the second sound pickup range is determined according to the first sound pickup range, and is different from the first sound pickup range.
- the video playback interface includes a first control and a second control, the first control corresponds to the second audio, and the second control corresponds to the third audio.
- the electronic device may One or more reference first sound pickup ranges are determined in the vicinity of the pickup range. Wherein, the electronic device obtains one channel of audio according to the first sound pickup range, and obtains at least one channel of audio according to the reference first sound pickup range, and the electronic device may also use panoramic audio as one channel of audio. Then, the electronic device can obtain the multi-channel audio corresponding to the first face image or the first human mouth image based on the first sound pickup range. Among them, one channel of audio can be understood as an audio file.
- the recording function may include a single-channel recording function and a multi-channel recording function.
- the single-channel video recording function refers to displaying a viewfinder frame during the shooting process of the electronic device, which is used for recording a video image of one channel.
- the multi-channel video recording function means that the electronic device displays at least two viewfinder frames during the shooting process, and each viewfinder frame is used for one video frame.
- each channel of video images and the corresponding audio collection method can refer to the implementation method of the single-channel recording function.
- the electronic device can switch and play audios corresponding to different sound pickup ranges, provide the user with a variety of audio playback options, realize the adjustability of the audio, and improve the user's audio playback experience.
- the method further includes: in response to the fourth operation, playing the video picture and the second audio.
- the fourth operation includes an operation of operating a playback control or an operation of operating the first control.
- a fifth operation of operating the second control is detected.
- the video picture and the third audio are played.
- the electronic device may display a video playback interface without playing audio first. After detecting the user's instruction operation, the electronic device plays the audio indicated by the user.
- the method further includes: deleting the second audio or the third audio in response to the operation of deleting the second audio or the third audio.
- the audio that the user does not want to save can be deleted according to the user's requirements, thereby improving the user experience.
- the present application provides an electronic device comprising: a processor, a memory, a microphone, a camera and a display screen, the memory, the microphone, the camera, and the display screen are coupled to the processor, and the memory is used for storing computer program codes
- the computer program code includes computer instructions that, when read by the processor from the memory, cause the electronic device to perform an operation of detecting a first operation of opening the camera application.
- a shooting preview interface is displayed.
- a second operation to start recording is detected.
- the video picture and the first audio are collected, and a shooting interface is displayed, and the shooting interface includes a preview interface of the video picture.
- the target image is the first face image and/or the first mouth image; wherein, the first face image is the face image of the sounding object in the video image, and the first mouth image is The human mouth image of the vocalized object in the video image.
- the first sound pickup range corresponding to the sounding object is determined.
- the second audio frequency corresponding to the video picture is obtained, and the audio volume within the first sound pickup range in the second audio frequency is greater than the audio volume outside the first sound pickup range.
- determining the first sound pickup range corresponding to the sounding object according to the target image including: obtaining a first feature value according to the target image; wherein the first feature value includes pre- and post-position attribute parameters, and the area occupies ratio, one or more items of position information; among them, the front and rear attribute parameters are used to indicate whether the video picture is a video picture captured by the front camera or a video picture captured by the rear camera; the area ratio is used to indicate the size of the target image. The ratio of the area to the area of the video screen; the position information is used to indicate the position of the target image in the video screen. According to the first feature value, the first sound pickup range corresponding to the sounding object is determined.
- determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: when the video picture is a front video picture, determining that the first sound pickup range is the sound pickup range on the front camera side. pickup range. When the video picture is a rear video picture, it is determined that the first sound pickup range is the sound pickup range on the rear camera side.
- determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the first sound pickup range according to the area ratio and the sound pickup range of the first audio frequency.
- determining the first sound pickup range corresponding to the sounding object according to the first feature value includes: determining the position of the first sound pickup range in the sound pickup range of the first audio according to the position information.
- the position information includes a first offset of the center point of the target image relative to a first reference point, where the first reference point is the center point or the focus of the video image.
- determining the position of the first sound pickup range in the sound pickup range of the first audio frequency includes: according to the first offset, determining the difference between the center point of the first sound pickup range relative to the sound pickup range of the first audio frequency
- the second offset of the center point, the second offset is proportional to the first offset. According to the second offset, the position of the first sound pickup range in the sound pickup range of the first audio frequency is determined.
- the center point of the video picture is the center point of the viewfinder frame, or the center point of the video picture is the center point of the display screen.
- obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio frequency including: enhancing the audio signal within the first sound pickup range in the first audio frequency, and/or Attenuate audio signals in the first audio frequency outside the first sound pickup range to obtain the second audio frequency.
- the electronic device includes one or more microphones, and the one or more microphones are used to collect the first audio.
- Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when part or all of the first sound pickup range is included in the sound pickup range of the first microphone in the one or more microphones, executing The second audio is obtained by at least one of the following operations: enhancing the audio signal within the first sound pickup range in the sound pickup range of the first microphone; attenuating the audio signal outside the first sound pickup range in the sound pickup range of the first microphone; attenuating one or Audio signals of other microphones except the first microphone among the plurality of microphones.
- the electronic device includes at least two microphones, and the at least two microphones are used to collect the first audio.
- Obtaining the second audio corresponding to the video picture according to the first sound pickup range and the first audio includes: when the sound pickup range of the second microphone in the at least two microphones does not include the first sound pickup range, turning off the second microphone, at least Audio collected by other microphones in the two microphones except the second microphone is the second audio.
- the electronic device when the second microphone is turned off, when the processor reads the computer instructions from the memory, the electronic device further causes the electronic device to perform the following operation: enhance the microphones of the at least two microphones except the second microphone. Audio signals within the first sound pickup range in the sound pickup range, and/or attenuate audio signals outside the first sound pickup range in the sound pickup ranges of the other microphones except the second microphone among the at least two microphones.
- the number of the first face images is one or more, and the number of the first human mouths is one or more.
- the electronic device when the processor reads the computer instructions from the memory, the electronic device also causes the electronic device to perform the following operation: detecting the third operation of stopping shooting. In response to the third operation, recording is stopped and a recorded video is generated; the recorded video includes a video picture and a second audio. Detect the fourth operation of playing the recorded video. In response to the fourth operation, a video playing interface is displayed, the video picture and the second audio are played.
- the recorded video further includes third audio
- the third audio is audio determined according to the second sound pickup range
- the second sound pickup range is determined according to the first sound pickup range, and is different from the first sound pickup range.
- the video playback interface includes a first control and a second control, the first control corresponds to the second audio, and the second control corresponds to the third audio.
- the processor when the processor reads the computer instructions from the memory, it also causes the electronic device to perform the following operations.
- the video picture and the second audio are played; the fourth operation includes an operation of operating a playback control or an operation of operating the first control.
- a fifth operation of operating the second control is detected.
- the video picture and the third audio are played.
- the electronic device when the processor reads the computer instructions from the memory, the electronic device further causes the electronic device to perform the following operation: in response to the operation of deleting the second audio or the third audio, delete the second audio or the third audio.
- the electronic device when the processor reads the computer instructions from the memory, the electronic device further causes the electronic device to perform the following operation: detect the sixth operation of starting the speech enhancement mode. In response to the sixth operation, the speech enhancement mode is activated.
- the present application provides an electronic device having the function of implementing the audio processing method described in the first aspect and any of the possible implementation manners.
- This function can be implemented by hardware or by executing corresponding software by hardware.
- the hardware or software includes one or more modules corresponding to the above functions.
- the present application provides a computer-readable storage medium, including computer instructions, which, when the computer instructions are executed on an electronic device, cause the electronic device to perform any one of the first aspect and any of the possible implementations.
- the audio processing method includes
- the present application provides a computer program product that, when the computer program product is run on an electronic device, causes the electronic device to perform the audio processing described in any one of the first aspect and any of the possible implementations. method.
- a circuit system in a sixth aspect, includes a processing circuit, and the processing circuit is configured to perform the audio processing method as described in the above-mentioned first aspect and any one of the possible implementation manners.
- an embodiment of the present application provides a chip system, including at least one processor and at least one interface circuit, where the at least one interface circuit is configured to perform a transceiving function and send instructions to the at least one processor.
- the at least one processor executes the audio processing method described in the first aspect and any one of the possible implementations.
- FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- FIG. 2A is a schematic layout diagram of a camera provided by an embodiment of the present application.
- FIG. 2B is a schematic layout diagram of a microphone according to an embodiment of the present application.
- FIG. 3 is a schematic block diagram of a software structure of an electronic device provided by an embodiment of the present application.
- FIG. 4 is a schematic diagram 1 of a group of interfaces provided by an embodiment of the present application.
- FIG. 5 is a schematic diagram 1 of a pickup range provided by an embodiment of the present application.
- FIG. 6 is a schematic flowchart 1 of an audio processing method provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram 1 of an interface provided by an embodiment of the present application.
- FIG. 8 is a second set of interface schematic diagrams provided by the embodiment of the present application.
- FIG. 9 is a second schematic diagram of a pickup range provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram three of a group of interfaces provided by the embodiment of the present application.
- FIG. 11 is a fourth set of interface schematic diagrams provided by the embodiment of the present application.
- Fig. 12 is a set of interface schematic diagram five provided by the embodiment of the present application.
- FIG. 13 is a schematic diagram of a coordinate system provided by an embodiment of the present application.
- FIG. 14 is a schematic diagram of an offset angle provided by an embodiment of the present application.
- 15 is a schematic diagram of an offset distance provided by an embodiment of the present application.
- 16A is a schematic diagram 1 of a first sound pickup range provided by an embodiment of the present application.
- 16B is a second schematic diagram of a first sound pickup range provided by an embodiment of the present application.
- FIG. 16C is a schematic diagram 3 of the first sound pickup range provided by the embodiment of the present application.
- FIG. 17 is a second schematic diagram of an interface provided by an embodiment of the present application.
- FIG. 18 is a schematic diagram six of a group of interfaces provided by the embodiment of the present application.
- FIG. 19 is a schematic diagram seven of a group of interfaces provided by the embodiment of the present application.
- FIG. 20 is a schematic diagram of a group of interfaces provided by the embodiment of the present application eight;
- FIG. 21 is a second schematic flowchart of an audio processing method provided by an embodiment of the present application.
- the electronic device may specifically be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a notebook computer, an ultra-mobile personal computer (ultra-mobile) personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), artificial intelligence (artificial intelligence) device, or special camera (for example, a single-lens reflex camera, a card camera), etc.
- AR augmented reality
- VR virtual reality
- a notebook computer an ultra-mobile personal computer (ultra-mobile) personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), artificial intelligence (artificial intelligence) device, or special camera (for example, a single-lens reflex camera, a card camera), etc.
- PDA personal digital assistant
- artificial intelligence artificial intelligence
- special camera for example, a single-lens reflex camera, a card camera
- FIG. 1 shows a schematic structural diagram of an electronic device 100 .
- the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
- SIM Subscriber identification module
- the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
- application processor application processor, AP
- modem processor graphics processor
- graphics processor graphics processor
- ISP image signal processor
- controller memory
- video codec digital signal processor
- DSP digital signal processor
- NPU neural-network processing unit
- the controller may be the nerve center and command center of the electronic device 100 .
- the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
- a memory may also be provided in the processor 110 for storing instructions and data.
- the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
- the processor 110 performs image recognition on multiple frames of images collected in a video picture, and obtains face image and/or human mouth image data contained in each frame of image.
- the processor 110 performs image recognition on multiple frames of images collected in a video picture, and obtains face image and/or human mouth image data contained in each frame of image.
- the voiced face and/or the voice in each frame of the image is determined.
- mouth position, proportion and other information is determined.
- the sound pickup range to be enhanced is determined according to information such as the position and proportion of the voicer's face and/or mouth in the video picture, that is, the position area of the voicer's voice in the panoramic audio is determined.
- Improves audio quality in recorded video by enhancing the audio signal within the pickup range.
- audio signals outside the pickup range are further attenuated to reduce the interference of external noises.
- the charging management module 140 is used to receive charging input from the charger.
- the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
- the power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the display screen 194, the camera 193, and the like.
- the wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
- the mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the electronic device 100 .
- the wireless communication module 160 can provide wireless communication including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (bluetooth, BT), etc. applied on the electronic device 100 . solution.
- WLAN wireless local area networks
- Wi-Fi wireless fidelity
- Bluetooth bluetooth, BT
- the electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
- the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
- the GPU is used to perform mathematical and geometric calculations for graphics rendering.
- Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
- Display screen 194 is used to display images, videos, and the like.
- Display screen 194 includes a display panel.
- the electronic device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
- the display screen 194 can display a shooting preview interface, a video preview interface and a shooting interface in the video recording mode, and can also display a video playing interface and the like during video playback.
- the electronic device 100 may implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
- the ISP is used to process the data fed back by the camera 193 .
- the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, converting it into an image visible to the naked eye.
- ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
- ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
- the ISP may be provided in the camera 193 .
- the ISP may control the photosensitive element to perform exposure and photographing according to the photographing parameters.
- Camera 193 is used to capture still images or video.
- the object is projected through the lens to generate an optical image onto the photosensitive element.
- the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
- CMOS complementary metal-oxide-semiconductor
- the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
- the ISP outputs the digital image signal to the DSP for processing.
- DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
- the electronic device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
- the camera 193 may be located in the edge area of the electronic device, may be an under-screen camera, or may be a camera that can be raised and lowered.
- the camera 193 may include a rear camera, and may also include a front camera. The embodiment of the present application does not limit the specific position and shape of the camera 193 .
- the layout of the camera on the electronic device 100 can be referred to FIG. 2A , where the front surface of the electronic device 100 is the plane where the display screen 194 is located.
- the camera 1931 is located on the front of the electronic device 100 , and the camera is a front-facing camera.
- the camera 1932 is located on the back of the electronic device 100 , and the camera is a rear camera.
- the solutions of the embodiments of the present application may be applied to the electronic device 100 having a folding screen with multiple display screens (that is, the display screen 194 can be folded).
- the folding screen electronic device 100 as shown in (c) of FIG. 2A .
- the display screen is folded inward (or outwardly) along the folded edge, so that the display screen forms at least two screens (eg, A screen and B screen).
- the camera on the C-screen is on the back of the electronic device 100, which can be regarded as a rear camera.
- the camera on the C-screen becomes on the front of the electronic device 100, which can be regarded as a front-facing camera. That is to say, the front camera and the rear camera in this application do not limit the nature of the cameras themselves, but are only an illustration of a positional relationship.
- the electronic device 100 can determine whether the camera is a front-facing camera or a rear-facing camera according to the position of the used camera on the electronic device 100, and then determine the direction of sound collection. For example, if the electronic device 100 currently collects images through a rear camera located on the back of the electronic device 100 , the electronic device 100 needs to focus on capturing the sound on the back of the electronic device 100 . For another example, the current electronic device 100 collects images through a front-facing camera located on the front of the electronic device 100 , and the electronic device 100 needs to focus on collecting sounds from the front of the electronic device 100 . This ensures that the captured sound matches the captured image.
- a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy and so on.
- Video codecs are used to compress or decompress digital video.
- the electronic device 100 may support one or more video codecs.
- the electronic device 100 can play or record videos of various encoding formats, such as: Moving Picture Experts Group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
- MPEG Moving Picture Experts Group
- MPEG2 moving picture experts group
- MPEG3 MPEG4
- MPEG4 Moving Picture Experts Group
- the NPU is a neural-network (NN) computing processor.
- NN neural-network
- Applications such as intelligent cognition of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
- the NPU uses an image recognition technology to recognize whether the image captured by the camera 193 includes a face image and/or a human mouth image. Further, the NPU can also confirm the voice-producing face or the voice-producing mouth according to the data of the face image and/or the mouth image, so as to confirm the sound pickup range that needs to perform directional recording.
- the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100 .
- the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
- Internal memory 121 may be used to store computer executable program code, which includes instructions.
- the processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
- the electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
- the audio module 170 is used for converting digital audio data into analog audio electrical signal output, and also for converting analog audio electrical signal input into digital audio data.
- the audio module 170 may include an analog/digital converter and a digital/analog converter.
- the audio module 170 is used to convert the analog audio electrical signal output by the microphone 170C into digital audio data.
- Audio module 170 may also be used to encode and decode audio data.
- the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
- Speaker 170A also referred to as “speaker” is used to convert analog audio electrical signals into sound signals.
- the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
- the receiver 170B also referred to as the "earpiece" is used to convert the analog audio electrical signal into a sound signal.
- the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
- the microphone 170C also called “microphone” or “microphone”, is used to convert sound signals into analog audio electrical signals.
- the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
- the microphone 170C may be a built-in component of the electronic device 100 or an external accessory of the electronic device 100 .
- the electronic device 100 may include one or more microphones 170C, wherein each microphone or multiple microphones cooperate to collect sound signals in various directions, and convert the collected sound signals into analog audio electrical signals. It can also achieve noise reduction, identify sound sources, or directional recording functions.
- FIG. 2B a schematic diagram of the layout of multiple microphones on two types of electronic devices 100 and the sound pickup range corresponding to each microphone are exemplarily given.
- the front of the electronic device 100 is the plane where the display screen 194 is located, and the microphone 21 is located on the top of the electronic device 100 (usually the earpiece, the camera the side), the microphone 22 is located on the right side of the electronic device 100, and the microphone 23 is located at the bottom of the electronic device 100 (the bottom part of the current angle of the electronic device 100 shown in FIG. 2B (a) is not visible, and the position of the microphone 23 is schematically represented by a dotted line) .
- the sound pickup range corresponding to the microphone 21 includes the front upper sound pickup range and the rear upper sound pickup range
- the sound pickup range corresponding to the microphone 22 includes the front middle pickup range
- the pickup range corresponding to the microphone 23 includes the front lower pickup range and the rear lower pickup range.
- the combination of the microphones 21 - 23 can collect sound signals from all directions around the electronic device 100 .
- the front camera may correspond to the front sound pickup range
- the rear camera may correspond to the rear sound pickup range. Then, when the electronic device 100 uses the front camera to record a video, the sound pickup range is determined to be the front sound pickup range.
- the sound pickup range is more accurately determined to be a certain range included in the front sound pickup range. The specific method is described in detail below.
- the electronic device 100 may further include a larger number of microphones, as shown in (c) of FIG. 2B , the electronic device 100 includes 6 microphones.
- the microphone 24 is located on the top of the electronic device 100
- the microphone 25 is located on the left side of the electronic device 100
- the microphone 26 is located at the bottom of the electronic device 100
- the microphones 27 - 29 are located on the right side of the electronic device 100 .
- the left part of the electronic device 100 shown in (c) of FIG. 2B is not visible at the current angle, and the positions of the microphone 25 and the microphone 26 are schematically indicated by dotted lines. As shown in FIG.
- the sound pickup range corresponding to the microphone 24 includes the sound pickup range above the front
- the sound pickup range corresponding to the microphone 25 includes the front middle pickup range
- the pickup corresponding to the microphone 26 includes the front lower sound pickup range
- the sound pickup range corresponding to the microphone 27 includes the rear upper sound pickup range
- the sound pickup range corresponding to the microphone 28 includes the rear middle pickup range
- the sound pickup range corresponding to the microphone 29 includes the rear sound pickup range.
- the combination of microphones 24 - 29 can collect sound signals from all directions around the electronic device 100 .
- the pickup ranges of the audio signals collected by the microphones of the electronic device 100 partially overlap, that is, the shaded parts in (b) and (d) of FIG. 2B .
- the sound quality of the sound signal collected by a certain microphone may be better (for example, the signal-to-noise ratio is high, and the spike noise and spur noise are relatively high. less, etc.), and the sound quality of the sound signal collected by the other microphone may be poor.
- the audio data with better sound quality in the corresponding direction is selected for fusion processing, and the audio with better effect is generated according to the processed audio data recording.
- the audio data collected by the multiple microphones can be fused to obtain the audio corresponding to the uttering face or the uttering mouth.
- the microphone 170C can be a directional microphone, which can collect sound signals in a specific direction.
- the microphone 170C can also be an anisotropic microphone, which can collect sound signals in various directions, or can collect sound signals within a certain range according to its position on the electronic device 100 .
- the microphone 170C is rotatable, and the electronic device 100 can adjust the sound pickup direction by rotating the microphone.
- the electronic device 100 can configure a microphone 170C, which can be rotated by rotating the microphone 170C.
- the microphone can pick up sound in all directions.
- the audio signals within the corresponding sound pickup range can be picked up by the combination of different microphones 170C.
- some of the microphones 170C can be used to pick up sound without using all the microphones 170C of the electronic device 100 .
- the audio signals collected by some microphones 170C are enhanced, and the audio signals collected by some microphones 170C are attenuated.
- This embodiment of the present application does not specifically limit the number of microphones 170C.
- the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
- the distance sensor 180F is used to measure the distance.
- the electronic device 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the electronic device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
- Touch sensor 180K also called “touch panel”.
- the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
- the touch sensor 180K is used to detect a touch operation on or near it.
- the electronic device 100 may detect the operation of the user instructing to start and/or stop recording through the touch sensor 180K.
- the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 100 .
- the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
- the illustrated components may be implemented in hardware, software or a combination of software and hardware.
- the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
- the embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 as an example.
- FIG. 3 is a block diagram of a software structure of an electronic device 100 according to an embodiment of the present invention.
- the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
- the operating system such as the Android system
- the operating system of the electronic device is divided into four layers, which are a kernel layer, a hardware abstract layer (HAL), an application framework layer, and an application layer from bottom to top. .
- the kernel layer is the layer between hardware and software.
- the kernel layer contains at least camera drivers, audio drivers, display drivers, and sensor drivers.
- the touch sensor 180K transmits the received touch operation to the upper-layer camera application through the sensor driver of the kernel layer.
- the camera application recognizes that the touch operation is an operation to start recording video
- the camera application invokes the camera 193 through the camera driver to record video images, and invokes the microphone 170C through the audio driver to record audio.
- the corresponding hardware interrupt is sent to the kernel layer, and the kernel layer can process the corresponding operation into an original input event (for example, the touch operation includes touch coordinates, time stamp of the touch operation and other information).
- Raw input events are stored at the kernel layer.
- the hardware abstraction layer is located between the kernel layer and the application framework layer, and is used to define the interface that drives the hardware implementation of the application program, and converts the value of the driver hardware implementation into the software implementation programming language. For example, identify the value of the camera driver, convert it into a software programming language and upload it to the application framework layer, and then call the camera service system.
- the HAL can upload the video images collected by the camera 193 and the raw data after face image recognition to the application framework layer for further processing.
- the original data after face image recognition may include, for example, face image data and/or human mouth image data, and the like.
- the face image data may include the number of voiced face images, the position information of the voiced face images in the video screen, etc.
- the human mouth image data may include the number of voiced face images, the number of voiced face images in the video screen, etc. location information, etc.
- the priority order of the face image data and the human mouth image data is preset.
- the human vocal organ is the human mouth
- the sound pickup range can be more accurately determined by the vocal human mouth data. Therefore, the priority order of setting the human mouth image data is higher than that of the face image data.
- HAL can determine the voiced face image data and voiced human mouth image data according to the collected video images, and upload the voiced human mouth data as raw data according to the priority order.
- the subsequent audio processing system determines the sound pickup range corresponding to the uttering mouth image according to the corresponding relationship between the video picture and the panoramic audio based on the uttering mouth image data.
- the HAL only determines the sounding face image data in the collected video images, and uploads the sounding face image data as raw data to determine the sound pickup range corresponding to the sounding face image.
- the HAL only determines the uttering mouth image data according to the video picture, and uploads the uttering mouth image data as raw data to determine the sound pickup range corresponding to the uttering mouth image.
- the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
- the application framework layer obtains the original input event from the kernel layer via the HAL, and identifies the control corresponding to the input event.
- the application framework layer includes some predefined functions.
- the application framework layer may include a camera service system, an audio processing system, a view system, a telephony manager, a resource manager, a notification manager, a window manager, and the like.
- the camera service system serves the camera application and is used to call the camera application to collect images based on the raw events input from the kernel layer.
- the audio processing system is used to manage the audio data and process the audio data with different audio algorithms. For example, in cooperation with the camera service system, the collected audio signals are processed during the recording process. For example, based on the face image data, the sound pickup range is determined, the audio signals within the sound pickup range are enhanced, and the audio signals outside the sound pickup range are weakened.
- the camera application invokes the camera service system of the application framework layer to start the camera application. Then, start the camera driver by calling the kernel layer, and capture the video through the camera 193 . And call the audio processing system, use the kernel layer to start the audio driver, collect sound signals through the microphone 170C, and generate analog audio electrical signals, and generate digital audio data from the analog audio electrical signals through the audio module 170, and generate audio according to the digital audio data. .
- the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
- a display interface can consist of one or more views.
- the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
- the phone manager is used to provide the communication function of the electronic device 100 .
- the management of call status including connecting, hanging up, etc.).
- the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
- the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
- the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
- a window manager is used to manage window programs.
- the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
- the application layer can include a series of application packages.
- the application package can include applications such as camera, video, call, WLAN, music, short message, Bluetooth, map, calendar, gallery, navigation, etc.
- the application layer and the application framework layer run in virtual machines.
- the virtual machine executes the java files of the application layer and the application framework layer as binary files.
- the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
- the audio processing method provided by the embodiment of the present application will be described below by taking the electronic device as a mobile phone having the structure shown in FIG. 1 and FIG. 3 as an example.
- the methods of the embodiments of the present application may be applied to a scenario in which a user instruction is received to directly start a camera application (hereinafter may also be referred to as a camera for short). It can also be applied to the scene where the user opens other third-party applications (such as short video applications, live broadcast applications, video calling applications, etc.) and invokes the startup of the camera.
- a camera application hereinafter may also be referred to as a camera for short.
- third-party applications such as short video applications, live broadcast applications, video calling applications, etc.
- the user may instruct the mobile phone to start the camera and display the shooting preview interface through a touch operation, a key operation, an air gesture operation, or a voice operation.
- the mobile phone in response to the user clicking the camera icon 41 , the mobile phone starts the camera and displays the shooting preview interface 402 shown in FIG. 4( b ).
- the mobile phone starts the camera in response to the user's voice instruction operation to turn on the camera, and displays the shooting preview interface 402 shown in (b) of FIG. 4 .
- the control 421 is used to set the shooting function of the mobile phone, such as time-lapse shooting.
- Control 422 is used to turn the filter function on or off.
- Control 423 is used to turn the flash function on or off.
- the camera can switch between different functions in response to the user's operation of clicking different function controls.
- the controls 431-434 are used to switch the functions that can be realized by the camera. If the control 432 is currently selected, the photographing function is activated. For another example, in response to the user clicking the control 431, switching to the portrait shooting function. Alternatively, in response to the user's operation of clicking on the control 433, the recording function is switched. Alternatively, in response to the user's operation of clicking the control 434, more switchable functions of the camera, such as panorama shooting, are displayed.
- the camera function is turned on by default.
- the video recording function is activated, and the video preview interface is displayed.
- a shooting preview interface 402 as shown in (b) in FIG. 4 is displayed by default.
- the mobile phone detects the operation of the user clicking on the control 433, the video recording function is activated, and the screen shown in (c) in FIG. 4 is displayed. 403 of the video preview interface.
- the mobile phone can also turn on the video recording function by default after starting the camera. For example, after the mobile phone starts the camera, the video recording preview interface 403 shown in (c) in FIG.
- the mobile phone detects the user's operation of opening the camera application, the recording function can be started.
- the mobile phone activates the video recording function by detecting an air gesture, or detecting a voice instruction operation. For example, when the mobile phone receives the user's voice command "open camera recording", it directly starts the recording function of the camera, and displays the recording preview interface.
- the mobile phone starts the camera, it defaults to a function that was last applied before the camera was turned off last time, such as a portrait shooting function. After that, by detecting the operation of enabling the video recording function, the video recording function of the camera is activated, and the video recording preview interface is displayed.
- the mobile phone after detecting switching to the video recording function, the mobile phone first asks the user whether to enable the voice enhancement mode. After the user confirms that the voice enhanced mode is turned on, the voice enhanced mode is activated. Or, the phone automatically activates the voice enhancement mode after detecting that it has switched to the video recording function. In still other embodiments, after detecting switching to the video recording function, the mobile phone first displays the video recording preview interface, and then after detecting the operation instructed by the user to shoot, activates the voice enhanced mode according to the user's instruction, or automatically activates the voice enhanced mode.
- the mobile phone in response to the operation of the user clicking the recording control 433, the mobile phone displays the recording preview interface 403 as shown in FIG. 4(c), and displays a prompt box in the recording preview interface 403. 44, used to prompt the user whether to activate the voice enhancement mode. If it is detected that the user clicks Yes, the voice enhancement mode is activated and a photographing interface 404 as shown in (d) in FIG. 4 is displayed. Alternatively, after the mobile phone is switched from the shooting preview interface 402 to the video recording function, the voice enhancement mode is directly activated and the shooting interface 404 shown in FIG. 4(d) is displayed.
- the mobile phone enables or disables the voice enhancement mode after detecting the user's operation of enabling or disabling the voice enhancement mode in the video recording preview interface or during the recording of the video screen.
- the operation of starting the voice enhancement mode may include, for example, an operation of clicking a preset control, a voice operation, and the like.
- the mobile phone can enable or disable the voice enhancement mode by detecting the user's operation on the control 46 .
- the current display state of the control 46 indicates that the voice enhancement mode is not activated on the current mobile phone.
- the voice enhancement mode is activated.
- the mobile phone can enable or disable the voice enhancement mode by detecting the user's operation on the control 46 before or during the shooting.
- the mobile phone After the voice enhancement mode is turned on, the mobile phone starts recording video images after detecting the operation instructed by the user to shoot, and can perform video encoding and other processing on the captured video images to generate and save video files.
- the mobile phone in response to the operation of the user clicking the shooting control 45, displays the shooting interface 404 shown in FIG. 4(d), and starts to perform the video screen. record.
- the voice enhancement mode is used to enhance the collection of the audio of some specific objects in the video shot of the video, thereby improving the audio recording effect. For example, if a user uses a camera to record video during an interview, it is necessary to focus on collecting the voice of the person being interviewed.
- the operation of the user instructing to shoot may include, for example, an operation of clicking a shooting control, an operation of voice instruction, and other operation methods.
- the large circle 501 is used to represent the maximum range (which can also be described as the panoramic sound pickup range) that can be picked up by all the microphones of the mobile phone, and the small circle 502 is used to represent the person concerned by the user. (usually the character who is vocalizing) the corresponding pickup range.
- the sound pickup range of the person concerned by the user ie, the sound pickup range 1
- the sound pickup range that needs to be enhanced for recording can be determined according to the position information of the image of the person concerned by the user in the recorded video picture. That is, the audio recording effect in the sound pickup range 1 shown in (b) in FIG. 5 is enhanced. In this way, the impact of other noises in the panoramic audio on the voice of the person concerned by the user in the recorded audio is reduced.
- the voice-producing face image identified by the mobile phone may be described as the first face image, and the voice-voicing face image may be described as the first human mouth image. Or it can also be described as a voiced face image or voiced mouth image.
- the number of first face images is one or more
- the number of first human mouth images is one or more. It is understandable that, if some characters are speaking in the currently captured video, but the mobile phone fails to recognize that they are speaking, the face image or mouth image of the unrecognized person who speaks is not classified as the above-mentioned No. A face image or a first mouth image.
- the mobile phone starts the voice enhancement mode and starts recording video images, it needs to recognize the first face image or the first mouth image, and according to the first face image or the first mouth image, determine the first image that needs to enhance the recording effect. sound range for better recording.
- the mobile phone calls the microphone corresponding to the first sound pickup range to enhance the audio signal within the first sound pickup range.
- the cell phone includes one or more microphones for capturing the first audio (ie, the initial audio signal).
- the first sound pickup range is included in the sound pickup range of the first microphone of the one or more microphones, enhancing the audio signal within the first sound pickup range of the first microphone pickup range; and/or attenuating the audio signal
- the audio signal outside the first sound pickup range in the sound pickup range of the first microphone and/or attenuate the audio signals of other microphones in one or more microphones except the first microphone to obtain the second audio frequency (that is, the first face image or audio corresponding to the first human mouth image).
- the mobile phone includes at least two microphones, and the at least two microphones are used to collect the first audio.
- the second microphone is turned off, and the audio collected by the other microphones of the at least two microphones except the second microphone is the second audio.
- the second microphone is turned off, enhance the audio signal in the first sound pickup range in the sound pickup range of other microphones in the at least two microphones except the second microphone, and/or attenuate the audio signals in the at least two microphones except the second microphone. Audio signals outside the first sound pickup range in the sound pickup ranges of other microphones.
- the mobile phone is configured with a microphone 1 and a microphone 2 .
- the first sound pickup range is within the sound pickup range of the microphone 1, then the mobile phone can enhance the audio signal within the first sound pickup range collected by the microphone 1 in the initial audio signal after using the microphone 1 and the microphone 2 to obtain the initial audio signal. , and at the same time attenuate the audio signal outside the first pickup range collected by the microphone 1 in the initial audio signal, and attenuate the audio signal collected by the microphone 2 to obtain the audio corresponding to the first face image or the first human mouth image.
- the mobile phone turns off the microphone 2, enhances the audio signal within the first pickup range in the audio signal collected by the microphone 1, and attenuates the audio signal outside the first pickup range, and then obtains the first face image or the first human mouth image. corresponding audio.
- the mobile phone is configured with a microphone 1 and a microphone 2 .
- the first sound pickup range includes a sound pickup range 1 within the sound pickup range of the microphone 1 , and a sound pickup range 2 within the sound pickup range of the microphone 2 . That is to say, the first sound pickup range is the union of the sound pickup range 1 and the sound pickup range 2 .
- the mobile phone uses the microphone 1 and the microphone 2 to obtain the initial audio signal, it can enhance the audio signals within the sound pickup range 1 of the microphone 1 and the sound pickup range 2 of the microphone 2 in the initial audio signal, and weaken the remaining audio signals in the initial audio signal.
- the audio signal is used to obtain the audio corresponding to the first face image or the first human mouth image. It can be understood that the pickup range 1 and the pickup range 2 may overlap in whole or in part.
- the shooting interface 404 includes a viewfinder frame 48 for displaying a video image.
- the sound pickup range corresponding to the viewfinder frame 48 is the maximum sound pickup range of the currently recorded video picture.
- the mobile phone recognizes the first face image 47 , and assuming that the first face image is located in the center of the viewfinder frame 48 , the mobile phone determines that the first sound pickup range is the center of the maximum sound pickup range.
- the phone boosts the audio signal in the first pickup range.
- a prompt box 49 is displayed on the shooting interface 404 for prompting the user that the recording effect of the middle position has been enhanced.
- the prompt box 49 can be continuously displayed during the shooting process, the displayed content changes with the change of the first sound pickup range, and is automatically hidden after the shooting is stopped. Alternatively, it is only displayed within a preset time period, and automatically disappears after the preset time period, so as to avoid blocking the displayed video picture of the viewfinder frame 48 .
- the mobile phone can obtain the audio corresponding to the uttering face or the uttering mouth by enhancing the audio signal within the first pickup range, so as to enhance the sound recording effect of the uttering face or the uttering mouth, so that Reduce the interference of external noise.
- the audio signal outside the first sound pickup range can also be weakened to obtain a better recording effect.
- only the audio signals outside the first pickup range are attenuated to reduce the interference of external noises.
- FIG. 6 is a schematic flowchart of an audio processing method provided by an embodiment of the present application.
- the mobile phone described above through (a)-(d) in FIG. 4 is used to identify the first face image or the first human mouth image, and determine the first object that needs voice enhancement.
- the sound range and the process of obtaining the audio corresponding to the first pickup range are described in detail.
- the mobile phone recognizes the first face image or the first human mouth image.
- the mobile phone may recognize the first face image or the first human mouth image through a face image recognition algorithm. For example, in the process of recording a video image, the mobile phone determines whether the captured video image contains a face image through a face image recognition algorithm. If a face image is included, the included face image is identified, and whether it is uttering is determined according to changes in the facial feature data of the face image, such as facial feature data, facial contour data, etc. within a preset time period. Wherein, the criterion for judging that the face image is uttering includes the mobile phone judging that the face image is currently uttering.
- the mobile phone determines that the face image is uttering again within a preset time period after judging that the face image utters the sound for the first time, and then it is determined that the face image is uttering sound.
- the human vocal organ is the human mouth.
- the vocal human mouth data can be obtained, the data of the first human mouth image can be preferentially determined, and then the first sound pickup range can be determined based on the data of the first human mouth image.
- the mobile phone collects the face image 71, and recognizes the facial feature key points corresponding to the face image 71 through the face image recognition algorithm (such as the circle displayed on the face image 71). feature points to determine if it is vocalizing). And face data and/or mouth data can be obtained.
- the facial feature points include upper lip feature points and lower lip feature points, and the distance between the upper and lower lips can be obtained in real time according to the upper lip feature points and the lower lip feature points. Then preset the distance threshold between the upper lip and the lower lip of the face image.
- the mobile phone detects that the distance between the upper lip and lower lip of the face image exceeds the distance If the number of times of the threshold exceeds the preset number of times, it is determined that the current face image is uttering sound.
- the facial feature points may also include facial contour feature points
- the mobile phone can obtain data such as jaw changes and facial muscle changes according to the facial contour feature points, and then determine whether the facial image is vocalizing. For example, within a preset time period, the number of times that the change data generated by the up and down movement of the chin exceeds the preset threshold exceeds the preset number of times, it is determined that the current face image is uttering sound.
- the mobile phone can also determine the voice-producing face or the voice-producing mouth according to changes in other data corresponding to the human mouth, such as the Adam's apple change data.
- the mobile phone can also combine the above-mentioned face data and mouth data to realize more accurate recognition of the first face image or the first face image.
- the number of the first face images is one or more. In a scene in which the number of the first face images is multiple, that is, in a scene in which multiple face images utter simultaneously or multiple face images utter successively within the first preset time period, the mobile phone can exclude the face images among them.
- a face image with a small area or at the edge of the video screen is not considered to be the first face image.
- the face image that the user is concerned about should be a face image with a large area, or a face image displayed in the middle or near the middle of the video screen. face image.
- the first preset time period may be a pre-configured short time range.
- the mobile phone determines that user A is speaking, and starts timing at the time when user A stops speaking, and detects that user B starts speaking within the first preset time period. . Further, within the first preset time period after user B stops uttering, it is detected that user A starts uttering again. That is to say, during the recording process, user B speaks immediately after user A speaks, or user A and user B speak alternately, the face images corresponding to user A and user B can be confirmed as the first face image. Then, frequent confirmation of the sound pickup range corresponding to the first face image in a short time range can be avoided, the data processing amount can be reduced, and the efficiency can be improved at the same time.
- the mobile phone confirms the face image with the largest area or the face image closest to the center of the video screen, and the area difference between the face image and the face image is less than the preset value.
- the thresholded voiced face image is identified as the first face image.
- the face image and the voice-producing face image within a preset range near the face image are confirmed as the first face image, so as to determine the first sound pickup range according to the first face image.
- the scenario in which the mobile phone determines the multiple first human mouth images is the same as the scenario in which the multiple first human face images are determined, and will not be described again.
- the center point of the video image includes, for example, the center point of the viewfinder frame, the center point of the display screen of the mobile phone, and the like.
- the mobile phone acquires the first feature value corresponding to the first face image or the first human mouth image.
- the mobile phone determines the first sound pickup range according to the first feature value.
- the first feature value is used to describe the relative positional relationship between the face of the real person corresponding to the first face image and the mobile phone, or the first feature value is used to describe the mouth of the real person corresponding to the first face image and the mobile phone relative positional relationship. Therefore, the mobile phone can determine the first sound pickup range according to the first feature value. For example, if the real person corresponding to the first face image is located directly in front of the mobile phone, that is, the first face image is located in the center of the captured video image, the first sound pickup range is the sound pickup range directly in front of the mobile phone.
- the first feature value includes one or more items of pre- and post-position attribute parameters, area ratio, and location information.
- the front and rear attribute parameters, the area ratio and the position information are parameters determined by the mobile phone according to the first face image or the first human mouth image, and their meanings are described in the following description.
- the following describes a specific method for the mobile phone to determine the first sound pickup range when the first feature value includes different parameters.
- the first feature value includes a front-to-back attribute parameter of the first face image, or the first feature value includes a front-to-back attribute parameter corresponding to the first human mouth image.
- the "front and rear attribute parameter" is used to indicate that the video picture containing the first face image or the first human mouth image is the video picture captured by the front camera (for the convenience of description, this paper is also referred to as the front video picture), It is also a video image captured by the rear camera (for convenience of description, it is also referred to as a rear video image in this document).
- the front and rear attribute parameters can be used to determine whether the first sound pickup range is within a range of 180 degrees in front of the mobile phone or within a range of 180 degrees behind. Exemplarily, as shown in (b) of FIG.
- the sound pickup range corresponding to the front video picture includes the ranges represented by ellipse 204 , ellipse 205 and ellipse 206
- the sound pickup range corresponding to the rear video picture may include ellipse 201 .
- the range represented by ellipse 202 and ellipse 203 may include ellipse 201 .
- the video images displayed in the viewfinder of the mobile phone can be switched between the images captured by the front and rear cameras.
- the mobile phone As shown in FIG. 8( a ), as shown in the photographing interface 801 , the mobile phone is in the voice enhancement mode, and it is confirmed that there is a voice-producing face image 81 .
- the mobile phone confirms that the video picture where the voice-emitting face image 81 is located is the video picture captured by the front camera, that is, confirms that the first feature value is the front attribute parameter, then confirms that the first sound pickup range is within the front 180-degree range, and displays a prompt box 82, prompting the user that the pre-recording effect has been enhanced.
- the shooting interface 801 further includes a front-rear switching control 83 for switching between the front and rear cameras.
- the mobile phone can switch the front camera to the rear camera in response to the user's operation of clicking the front and rear switch control 83 .
- the video screen displayed by the mobile phone the video screen captured by the front camera displayed on the shooting interface 801 shown in (a) in FIG. video images captured by the camera.
- the mobile phone recognizes the voice-producing face image 84 in the current video screen, it determines that the first feature value is the post attribute acceptance number information, and the first sound pickup range is within the range of 180 degrees behind the mobile phone.
- the mobile phone displays a prompt box 85, prompting the user that the post-recording effect has been enhanced.
- the sound pickup range corresponding to the rear video picture is the ellipse 201, the range represented by the ellipse 202 and the ellipse 203, and the sound pickup range corresponding to the front video picture is the ellipse 204, the ellipse 205 and the ellipse 203.
- the range represented by ellipse 206 the mobile phone confirms that the first face image corresponds to the rear video image according to the first feature value, and then confirms that the first sound pickup range is the range represented by ellipse 201 , ellipse 202 and ellipse 203 .
- the mobile phone confirms that the first face image corresponds to the rear video screen according to the first feature value, then confirms that the first sound pickup range is the microphone 27, and the sound pickup corresponding to the microphone 28 and the microphone 29 Scope.
- the first feature value includes the area ratio corresponding to the first face image, or the first feature value includes the area ratio corresponding to the first human mouth image.
- the "area ratio” is used to represent the ratio of the area of the first face image or the area of the first mouth image to the area of the video screen. This area ratio is used to measure the radius (or diameter) of the audio collected by the microphone.
- the person concerned by the user is usually placed at the center of the video image, that is, the first face image or the first human mouth image is located at the center of the viewfinder frame.
- the sound pickup ranges corresponding to different areas of the first face image or the first mouth image are different.
- the mobile phone determines two first face images in different time periods, which are the first face image 1 and the first face image 2 respectively. The areas of the two face images are different, and the area of the first face image 1 is larger than the area of the first face image 2 .
- the determined sound pickup range is the sound pickup range 1 .
- the determined sound pickup range is the sound pickup range 2 .
- Pickup range 1 is greater than pickup range 2.
- X is used to represent the first face image area or the first human mouth image area.
- Y is used to indicate the area of the video frame displayed by the viewfinder.
- N represents the sound pickup range corresponding to the viewing range.
- the area ratio is used to represent the ratio of the area of the first face image to the area of the video picture displayed by the viewfinder frame.
- the number of the first face images may be one or more, and then the area of the first face image is the area of one face image or the sum of the areas of multiple face images.
- the sum of the areas of the multiple face images can be represented by the area of the occupancy frame where the multiple face images are located, that is, the area of the smallest selection frame containing the multiple face images.
- the number of the first face image is 1, and during the face image recognition process of the mobile phone, according to the feature of the top of the forehead in the facial feature points of the face image 11 Point position, the position of the feature point at the bottom of the chin, and the position of the feature point at the most edge of the left and right faces excluding the ears, determine the dotted frame 101 for the face area of the first face image 11, and the image area within the frame selection range is The first face image area. That is, in the process of confirming the first face area, only the face area is calculated, excluding the influence of ears, hats, accessories, necks, etc.
- the area of the video image displayed in the viewfinder frame is the image area within the frame selection range of the dotted line frame 102 . Then, the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 101 and the dotted frame 102 . Subsequently, for the method for determining the area of the first face image, reference may be made to the current method for determining the area of the first face image, which will not be repeated hereafter.
- the interface 1002 in (b) of FIG. 10 there are two face images displayed in the interface 1002 , both of which are recognized by the mobile phone as the first face image uttering sound.
- the area of the face image 12 on the right side is the image area within the frame selection range of the dotted frame 103
- the area of the face image 13 on the left side is the image area within the frame selection range of the dotted frame 104
- the area of the first face image is:
- the image area within the frame selection range of the dotted line frame 105 is the area of the smallest frame including all face images (for example, the total frame selection area is determined according to the edge limit value of all face image area selection frames).
- the dotted frame 105 is used to represent the placeholder frame where the face image 12 and the face image 13 are located.
- the finally determined first face image area simultaneously includes the image areas corresponding to the two face images.
- the area of the video image displayed in the viewfinder frame is the image area within the frame selection range of the dotted line frame 106 . Then, the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 105 and the dotted frame 106 .
- the mobile phone determines that the area of the right face image 14 is the largest.
- the mobile phone can exclude some voice-producing face images that users do not pay attention to through a preset threshold.
- the preset threshold is less than 20% of the maximum face image area.
- the mobile phone can exclude the left face image 15 that is smaller than 20% of the area of the right face image 14 .
- the first face image includes the face image 14 on the right.
- the preset threshold is that the distance from the face image with the largest area exceeds 35% of the length or width of the video picture displayed by the viewfinder.
- the mobile phone can exclude the left face image 15 whose distance from the right face image 14 exceeds 35% of the length of the video frame displayed in the viewfinder. Then, the first face image includes the right face image 14 .
- the area ratio is used to represent the ratio of the area of the first human mouth image to the area of the video picture displayed by the viewfinder frame.
- the number of the first human mouth images may be one or more, then the area of the first human mouth image is the area of one human mouth image or the sum of the areas corresponding to the multiple human mouth images.
- the area sum of the multiple human mouth images can be represented by the area of the occupancy frame where the multiple human mouth images are located, that is, by the area of the smallest box containing the multiple human mouth images.
- the number of the first mouth image is 1, and during the face image recognition process of the mobile phone, according to the feature points of the facial feature points in the top of the mouth image, In the lower left, the leftmost and rightmost feature point positions, determine the dotted frame 111 that frames the area of the first human mouth image 16, and the image area within the frame selection range is the area of the first human mouth image.
- the area of the video image displayed in the viewfinder frame is the image area within the frame selected by the dotted frame 112 .
- the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 111 and the dotted frame 112 .
- the interface 1102 displays two human mouth images, both of which are recognized by the mobile phone as vocalized human mouth images.
- the area of the first mouth image 17 on the right is the image area within the frame selection range of the dotted frame 113
- the area of the first mouth image 18 on the left is the image area within the frame selection range of the dotted frame 114
- the mouth image area is the image area within the frame selection range of the dotted frame 115, that is, the area of the smallest frame including all mouth images (for example, the total frame selection area is determined according to the edge limit value of all the mouth image area selection frames) .
- the dotted frame 115 is used to represent the placeholder frame where the first human mouth image 17 and the first human mouth image 18 are located.
- the finally determined first mouth image area simultaneously includes the image areas corresponding to the two human mouth images.
- the area of the video image displayed by the viewfinder frame is the image area within the frame selection range of the dotted frame 116 . Then, the mobile phone can determine the area ratio according to the area ratio corresponding to the identified dotted frame 115 and the dotted frame 116 .
- the mobile phone determines that the area of the right mouth image is the largest.
- the mobile phone can exclude some voice-producing mouth images that users do not pay attention to through a preset threshold.
- the preset threshold is less than 20% of the maximum mouth image area.
- the preset threshold is that the distance from the mouth image with the largest area exceeds 35% of the length or width of the video picture displayed by the viewfinder.
- the mouth image on the left side is excluded, and the first mouth image only includes the first mouth image on the right side. According to the first mouth image on the right side The area determines the radius of the first pickup range.
- the sound pickup range determined by the mobile phone according to the first feature value of the first face image as shown in (a) in FIG. 10 may be the sound pickup range 2 shown in FIG. 9 .
- the sound pickup range determined by the mobile phone according to the first feature value of the first face image as shown in (b) in FIG. 10 may be the sound pickup range 1 shown in FIG. 9 .
- the rectangular area is used as the corresponding first face image.
- the area of a face image or the area of the first mouth image can be understood that an irregular geometric figure can also be used to correspond to the first face image and the first human mouth image, so as to determine the corresponding area more accurately.
- the rectangle in the embodiment of the present application is only an exemplary illustration. There is no specific limitation on this embodiment of the present application.
- the area of the viewfinder frame is used as the area of the video screen. It can be understood that, in the case that the mobile phone is a full-screen mobile phone, the display area of the mobile phone can be used as the video screen area. Alternatively, other areas and areas of other shapes may also be used as the video screen area.
- the viewfinder frame area in this embodiment of the present application is only an exemplary description, which is not specifically limited in this embodiment of the present application.
- the first feature value includes position information corresponding to the first face image, or the first feature value includes position information corresponding to the first human mouth image.
- the "position information" is used to indicate the position of the first face image or the first face image in the video picture.
- the position information includes the offset of the center point of the first face image relative to the first reference point, such as the offset direction, and/or the offset angle, and/or the offset distance, and the like.
- the position information includes the offset of the center point of the first human mouth image relative to the first reference point.
- the first reference point is the center point of the video image or the focal point of focus.
- the offset direction means that the center point of the first face image or the first mouth image is shifted leftward, rightwardly, upwardly, downwardly, upwardly left, and upward relative to the first reference point. Offset to the top right, offset to the bottom left, or offset to the bottom right, etc.
- the offset angle is the angle pointing to the upper left offset, the upper right offset, the lower left offset or the lower right offset.
- the offset distance is the distance that points to the left offset, the right offset, the upward offset, the downward offset, or the offset distance at a certain offset angle, etc.
- the coordinates of the center point of the first face image may be determined according to the extreme positions of the feature points in various directions of the first face image. As described above, in the process of determining the area of the first face image, according to the facial feature points of the first face image, the position of the feature point at the top of the forehead, the position of the feature point at the bottom of the chin, and the feature points of the left and right faces that do not include ears position, to determine the coordinates of the center point of the first face image. Similarly, the center point coordinates of the first human mouth image are determined according to the positions of the uppermost, lower left, leftmost and rightmost feature points among the facial feature points of the human face image.
- the preset first reference point may include, for example, the center point of the video image displayed by the viewfinder frame (which may also be described as the center point of the viewfinder), the focus within the viewfinder range, and the like.
- the x-axis parallel to the bottom edge of the mobile phone (or the bottom edge of the current viewfinder frame) is the x-axis
- the direction perpendicular to the x-axis is y to construct a coordinate system
- the current coordinate system is parallel to the mobile phone display.
- the constructed coordinate system is used to define the offset direction, offset angle and offset distance of the center point of the first face image or the first human mouth image relative to the first reference point. Exemplarily, as shown in (a) of FIG.
- the coordinate system in the case of the vertical screen display of the mobile phone, the coordinate system, where the x-axis is parallel to the bottom edge (ie, the short side) of the mobile phone.
- the x-axis is parallel to the side (ie, the long side) of the mobile phone.
- the intersection of the x-axis and the y-axis, that is, the origin coordinate is (0, 0)
- the positive direction of the x-axis is right
- the positive direction of the y-axis is up.
- the number of the first face image is 1, the center point of the first face image is the position corresponding to the mark 121, and the center of the video image displayed in the viewfinder frame is The point is the position corresponding to the marker 122 .
- the position of the center point of the viewfinder frame is determined according to the edge limit coordinates of the top, bottom, left, and right of the viewfinder frame.
- the mobile phone determines the position information of the first face image according to the positional relationship between the identification 121 and the identification 122 . For example, in the scene displayed on the interface 1201, the position information of the first face image is the lower left of the center point of the viewfinder frame. Or, as shown in the interface 1202 in (b) of FIG.
- the number of the first face image is 1, the center point of the first face image is the position corresponding to the mark 123, and the center point of the video image displayed in the viewfinder frame is Identifies the location corresponding to 124 .
- the mobile phone determines the position information of the first face image according to the positional relationship between the identification 123 and the identification 124 . For example, in the scene displayed on the interface 1202, the position information of the first human mouth image is the lower left of the center point of the viewfinder frame.
- the center point of the first human face image is the center point within the image range composed of the multiple human face images.
- the center point of the first face image is the geometric center point of the frame selection range of the dotted frame 105 .
- the center point of the first human mouth image is the geometric center point of the range selected by the dotted frame 115 .
- the center point of the video picture displayed by the viewfinder frame is also the geometric center point of the viewfinder frame.
- the center point of the rectangle is used as the The corresponding center point of the first face image or the center point of the first face image.
- irregular geometric figures can also be used to correspond to the first face image and the first human mouth image, so as to more accurately determine the corresponding center point.
- the rectangle in the embodiment of the present application is only an exemplary illustration. , which is not specifically limited in this embodiment of the present application.
- the center point of the viewfinder frame is used as the first reference point, that is, the center point of the viewfinder frame is used to represent the position of the video image. center point.
- the first reference point may also be represented in other forms.
- the center point of the entire screen of the display screen of the mobile phone is used to represent the center point of the video image, that is, as the first reference point.
- the user may not place the object of interest in the center of the viewing range during video recording, but select the object of interest by focusing.
- the mobile phone can obtain the user's intention and determine the object that the user pays attention to.
- the focus position for focusing may also be the focus position obtained by automatic focusing of the mobile phone. For example, the mobile phone automatically recognizes the portrait, and determines the corresponding focus position after auto-focusing.
- the number of first face images is 2, and the center point of the first face image is the position corresponding to the mark 125 .
- the mobile phone detects the user's operation of clicking on the screen, obtains the focused focal position, and displays a dotted frame 126 .
- the range framed by the dotted frame 126 is the focus range determined by the mobile phone according to the user's intention.
- the central focus within the focus range is the position corresponding to the marker 127 .
- the mobile phone determines the position information of the first face image according to the positional relationship between the identification 125 and the identification 127 . For example, the position information of the first face image is the upper left of the focus center.
- the mobile phone may determine the first face image or the first mouth image according to the coordinates of the center point of the first face image or the coordinates of the center point of the first face image and the coordinates of the first reference point The relative positional relationship with the first reference point is then used to determine the offset direction of the first face image or the first mouth image in the video picture displayed in the viewfinder frame.
- the coordinate system shown in (a) or (b) of FIG. 13 refers to the coordinate system shown in (a) or (b) of FIG. 13 .
- the coordinates of the center point of the first face image or the center point of the first face image are (X1, Y1)
- the coordinates of the first reference point are (X2, Y2)
- the first reference point is set as the origin of the coordinate system ( 0, 0).
- the relative positional relationship between the first face image or the first face image and the first reference point can be referred to as shown in Table 2 below.
- X1 ⁇ X1 it means that the first face image or the first mouth image is located on the left side of the first reference point, that is, the offset direction is to the left.
- the mobile phone may determine, according to the coordinates of the center point of the first face image or the coordinates of the center point of the first face image and the coordinates of the first reference point, the position of the first face image displayed in the viewfinder frame.
- the offset angle in the video picture (as shown in Figure 14, the center point coordinates of the first face image or the center point coordinates (X1, Y1) of the first human mouth image and the first reference point (X2, Y2) Connecting line, the angle ⁇ with the X axis).
- the large circle 141 is used to indicate the maximum sound pickup range corresponding to the viewfinder frame of the mobile phone, and the coordinates of the center point of the viewfinder frame are set to (0, 0), that is, the center point of the viewfinder frame is set to the first reference point. .
- the maximum pickup range is divided into four quadrants, such as the first quadrant 142 , the second quadrant 143 , the third quadrant 144 and the fourth quadrant 145 .
- the mobile phone can determine the offset angle ⁇ based on the angle between the line connecting (X1, Y1) and (X2, Y2) in each quadrant and the x-axis, so 0 ⁇ 90°.
- the mobile phone may determine the position of the first face image displayed in the viewfinder frame according to the coordinates of the center point of the first face image or the coordinates of the center point of the first face image and the coordinates of the first reference point.
- the offset distance in the video frame According to the offset distance and the radius of the sound pickup range corresponding to the first face, the mobile phone can determine whether the sound pickup range corresponding to the first face image exceeds the sound pickup range corresponding to the viewing range, and then determine the first sound pickup range.
- the large circle 151 is the maximum sound pickup range corresponding to the viewfinder frame, and the radius is R.
- the first reference point is the center point of the video screen displayed in the viewfinder, that is, the center point of the maximum sound pickup range, with coordinates (X2, Y2), and the coordinates of the center point of the first face image are (X1, Y1).
- P is the ratio of the first face image to the area of the video image displayed by the viewfinder frame, that is, the area ratio parameter.
- the radius of the first sound pickup range is equal to the distance between the center point of the first face image and the edge of the maximum sound pickup range. If r ⁇ 1.5S, the radius of the first pickup range is equal to the product of the radius of the panorama pickup range and the area ratio parameter. In this case, the phone will not pick up sounds beyond the maximum pickup range. It can be understood that in the case of r>S, the method of determining the radius of the first sound pickup range by comparing the magnitudes of r and 1.5S is only an exemplary illustration, and other methods can also be used to determine the first sound pickup range. , to ensure that the mobile phone can pick up the audio data corresponding to the first face image. For example, the radius of the first sound pickup range is determined by comparing the magnitude of r and 2S.
- the geometric center point of the rectangle is converted into a rectangle.
- irregular geometric figures can also be used to correspond to the first face image and the first mouth image, so as to more accurately determine the corresponding center point position, and the rectangle in the embodiment of the present application is only an example. It is noted that this embodiment of the present application does not specifically limit it.
- the mobile phone may determine the first sound pickup range by using any one of the above-mentioned solutions 1 to 3. Alternatively, the mobile phone may determine the first sound pickup range after combining multiple solutions in the above-mentioned solutions 1 to 3. Alternatively, the mobile phone may determine the first sound pickup range by combining one or more parameters in the above solutions 1 to 3 with other parameters. Alternatively, the mobile phone may use other parameters to determine the first sound pickup range.
- the following introduces a method for confirming the first sound pickup range after the mobile phone combines the above-mentioned scheme 1 to the three-phase scheme.
- the mobile phone determines the first face image according to the front and rear attribute parameters of the video image corresponding to the first face image.
- the corresponding video picture is the rear video picture.
- the first sound pickup range is within a range of 180 degrees behind the mobile phone. That is, the range represented by the ellipse 161, the ellipse 162, and the ellipse 163.
- the mobile phone can further determine the first sound pickup range according to the position information corresponding to the first face image.
- the first face image is the face image on the left, and the center point 164 of the first face image is located at the upper left of the center point 165 of the viewfinder frame.
- the mobile phone determines that the offset direction is the upper left, and the center point of the first pickup range is located at the upper left of the center point of the rear pickup range.
- the first pickup range can be seen in (b) in Figure 16B.
- Ellipse 161 and ellipse 162 represent the left side of the range.
- the large circle 166 is the maximum sound pickup range corresponding to the rear video screen, and the corresponding left and right pickup range can be confirmed by dividing the sound pickup range left and right along the center dotted line.
- the first sound pickup range at the upper left of the rear can refer to the range represented by the left half ellipse 1611 and the left half ellipse 1621 shown in (c) of FIG. 16B .
- the position information also includes an offset angle and an offset distance. If the offset angle is greater than 45 degrees, the offset distance is greater than 1/2 of the radius of the video image displayed by the viewfinder. That is, the first face image is located above the center position of the video image displayed in the viewfinder frame, and is far away from the center position. As shown in (a) of FIG. 16C , the first face image is the left face image, and the offset distance between the center point 166 of the first face image and the center point 167 of the viewing frame is relatively large. Then, the middle sound pickup range has little auxiliary effect on the audio corresponding to the first face image, and the first sound pickup range can refer to the range represented by the ellipse 161 shown in (b) in FIG. 16C . Further, the first face image may be the range represented by the left half ellipse 1611 shown in (c) of FIG. 16B .
- the mobile phone is based on the front and rear attribute parameters of the video picture corresponding to the first face image, and the first face image.
- the mobile phone determines the sound pickup range according to the front and rear attribute parameters of the video picture corresponding to the first mouth image and the position information corresponding to the first mouth image.
- the mobile phone can determine the final first sound pickup range according to the area ratio corresponding to the first face image.
- the mobile phone can determine the radius of the first sound pickup range corresponding to the first face image through the area ratio and the sound pickup range corresponding to the viewing range.
- the circle 152 as shown in (a) of FIG. 15 delineates the first sound pickup range.
- the radius of the circle 152 may be used to correspond to the radius range representing the first sound pickup range.
- the first sound pickup range can be represented by the range represented by the left half ellipse 1611 shown in (c) of FIG. 16B .
- the radius of the first sound pickup range is finally determined as the distance between the center point of the first face image and the edge of the maximum sound pickup range.
- the first sound pickup range can be represented by the range represented by the left half ellipse 1611 and the left half ellipse 1612 shown in (c) of FIG. 16B .
- the mobile phone in the process of determining the first sound pickup range in the mobile phone in combination with the solutions in the above-mentioned solutions 1 to 3, there is no restriction on the order of determining the parameters, and the mobile phone can adopt other methods different from those in the above examples.
- the parameters are determined sequentially. For example, each parameter is determined at the same time.
- the first sound pickup range corresponding to the first face image or the first mouth image can be determined, and then the first sound pickup range can be used to acquire audio subsequently, thereby improving the audio quality.
- the mobile phone acquires audio according to the first sound pickup range.
- the mobile phone may use a single microphone or multiple microphones to collect sound signals from various directions around, that is, to collect panoramic sound signals. After the mobile phone preprocesses the panoramic sound signals collected by the multiple microphones, initial audio data can be obtained, where the initial audio data includes sound information in various directions. Then, the mobile phone can record the audio corresponding to the first face image according to the initial audio data and the first sound pickup range.
- the mobile phone can enhance the sound within the first sound pickup range in the initial audio data, and the sounds outside the first sound pickup range can be enhanced.
- the sound is suppressed (or attenuated), and then the processed audio data is recorded to obtain the audio corresponding to the first face image or the first human mouth image.
- the audio corresponding to the first face image or the first mouth image records the sound within the first sound pickup range
- the first sound pickup range is based on the first sound pickup range corresponding to the first face image or the first mouth image.
- a sound pickup range determined by a feature value so the sound in the first sound pickup range is the corresponding sound of the uttering face or the uttering mouth that the user pays attention to. That is to say, the interference of the noise in the recorded video picture to the voice made by the voice-emitting person's face or the voice-emitting person's mouth is reduced.
- directional voice enhancement can be performed in a complex shooting environment, and only audio algorithms can be used to enhance processing of some audio signals, which can simplify audio processing algorithms, improve processing efficiency, and reduce the computing performance of mobile phone hardware. requirements.
- the mobile phone can pick up sound in the first One or more reference first pickup ranges are determined in the vicinity of the range.
- the mobile phone obtains one channel of audio according to the first sound pickup range, and obtains at least one channel of audio according to the reference first sound pickup range, and the mobile phone may also use panoramic audio as one channel of audio.
- the mobile phone can obtain the multi-channel audio corresponding to the first face image or the first human mouth image based on the first sound pickup range.
- one channel of audio can be understood as an audio file.
- the mobile phone may determine the corresponding one or more reference first sound pickup ranges according to the area ratio corresponding to the first face image or the first mouth image. It is assumed that the first sound pickup range is determined as the first sound pickup range and the reference first sound pickup range according to the area parameter ratio information. For example, based on Table 1, as shown in Table 4 below, the mobile phone can determine the first sound pickup range and the reference first sound pickup range according to the rules in Table 4 below. In Table 4 below, the first sound pickup range is a recommended value, and the reference first sound pickup range includes enhancement value 1, enhancement value 2, and enhancement value 3.
- the mobile phone may determine the audio corresponding to the first sound pickup range and the reference first sound pickup range according to different audio processing methods. For example, based on the above process of determining the first sound pickup range, the audio corresponding to the first sound pickup range is the audio determined by the Dolby sound effect algorithm, and the audio corresponding to the reference first sound pickup range is the audio determined according to the Histen sound effect algorithm. As shown in Table 5 below, Algorithm 1-Algorithm 4 are different audio algorithms, and the audio corresponding to the first sound pickup range and the reference first sound pickup range is determined according to different audio algorithms.
- the first sound pickup range is a recommended value
- the reference first sound pickup range includes enhancement value 1, enhancement value 2, and enhancement value 3.
- the mobile phone can obtain the first sound pickup range and the reference first sound pickup range corresponding to the area parameter ratio information corresponding to the first face image or the first mouth image and the audio algorithm. audio.
- the first sound pickup range is a recommended value
- the reference first sound pickup range includes enhancement value 1, enhancement value 2, and enhancement value 3.
- the mobile phone may also use other methods to determine the reference first sound pickup range, which is not specifically limited in this embodiment of the present application.
- the mobile phone can process the initial audio data to enhance the sound within the reference first sound pickup range, suppress the sound outside the reference first sound pickup range, and then record the processed audio data to obtain the first face image or One or more channels of audio corresponding to the first human mouth image.
- the mobile phone can record and obtain the first feature value corresponding to the first face image or the first mouth image and the first face image or the first mouth image according to the first sound pickup range and the reference first sound pickup range
- the picture matches the multi-channel audio for the user to choose to play later.
- each channel of audio data corresponding to the first face image or the first mouth image may be saved as one audio file, and the first face image may correspond to multiple audio files.
- the multi-channel audio is within different sound pickup ranges provided by the user.
- the number of audios is more, the possibility of matching the sound corresponding to the first face image or the first mouth image that the user is concerned about is greater, and the selectivity of the user's audio playback is also greater.
- the mobile phone may also record audio corresponding to the first face image or the first human mouth image according to the first sound pickup range selected by the user or with reference to the first sound pickup range.
- the mobile phone detects that the user clicks the operation of the recommended value selection control 171, then in the process of recording the video screen, the first face is recorded according to the first pickup range and the initial audio data.
- the audio corresponding to the image or the first human mouth image is recorded according to the first pickup range and the initial audio data.
- the mobile phone detects that the user clicks the enhancement value 1 to select the control, in the process of recording the video screen, according to the reference first sound pickup range and the initial audio data corresponding to the enhancement value 1, record the first face image or The audio corresponding to the first human mouth image.
- the mobile phone detects that the user clicks the no-processing selection control 172, in the process of recording the video image, the audio signals in various directions are fused according to the initial audio data to obtain panoramic audio. That is, the audio corresponding to the non-processing selection control 172 is panoramic audio, which can also be understood as the audio obtained when the mobile phone is in the non-voice enhancement mode.
- the methods for determining the recommended value, the enhancement value 1, the enhancement value 2 and the enhancement value 3 in the interface 1701 can be referred to as shown in Tables 4 to 6 above, and will not be repeated here.
- the user may experience the recording effects corresponding to different sound pickup ranges before formally recording the video picture, and then determine the sound pickup range to be selected in the final video picture recording process.
- the mobile phone can save only the corresponding audio files according to the user's choice. While ensuring that the needs of users are met, the storage space of the mobile phone can be saved.
- the first sound pickup range may be changed to the second sound pickup range during the process of recording video images by the mobile phone.
- the mobile phone when it is recording a video screen, it detects an operation instructing the user to switch the front and rear cameras.
- the sound pickup range before switching is the first sound pickup range
- the sound pickup range after the switch is the second sound pickup range.
- the audio recorded by the mobile phone at least includes the audio of the first duration and the audio of the second duration.
- the first duration audio frequency is the audio corresponding to the first sound pickup range
- the second duration audio frequency is the audio frequency corresponding to the second sound pickup range.
- the mobile phone can dynamically determine the sound pickup range based on the change of the voice-emitting face or the voice-emitting mouth in the video screen, and then record the audio according to the pickup range.
- the audio of the formed video picture may include multiple audios of different durations or the same duration recorded based on the changed sound pickup range according to the time sequence.
- the mobile phone can always focus on improving the audio recording quality of the part that needs to be enhanced according to the change of the pickup range, thereby ensuring the audio recording effect.
- the user when the user plays the video file, the user can be presented with a dynamically changing playing experience, such as a sound range that matches the change of the video content.
- the first feature value corresponding to the first face image or the first human mouth image changes, resulting in a change in the sound pickup range.
- the front and rear attribute parameters of the video picture change, resulting in the change of the first sound pickup range.
- the interface 1801 displays the front video image.
- the mobile phone detects that the user clicks the front and rear switch control 181, switches to the rear camera to shoot, and displays the interface 1802 shown in (b) of FIG. 18 .
- the first feature value corresponding to the first face image or the first human mouth image changes, and the audio within the duration of 00:00-00:15 in the recorded audio is the first sound pickup range
- the corresponding audio, the audio after 00:15 is the audio corresponding to the second pickup range.
- the position information corresponding to the first face image or the first human mouth image changes, resulting in a change in the first sound pickup range.
- the picture range and picture size of the video picture in the viewfinder frame will change with the change of the zoom factor (ie, the Zoom value).
- the zoom factor may be a preset zoom factor, the last zoom factor used before the camera was turned off, or a zoom factor pre-indicated by the user, and the like.
- the zoom factor corresponding to the viewfinder frame can also be changed according to the user's instruction. Then, in a scene, as the zoom factor changes, the viewing range changes. Correspondingly, the area of the first face image or the area of the first mouth image, and then the area of the first face image or the proportion of the area corresponding to the first face image changes. That is to say, the change of the zoom factor will lead to the change of the pickup range. In this way, in the subsequent video playback process, the recorded audio can be dynamically changed with the change of the display area of the video content, etc., so as to improve the user's playback experience.
- the mobile phone can determine the sound pickup range corresponding to the viewing range and the sound pickup range corresponding to the area ratio of the first face image or the area ratio of the first mouth image according to the zoom factor.
- Table 7 where X is used to represent the area of the first face image or the area of the first mouth image.
- Y is used to indicate the area of the video frame displayed by the viewfinder.
- the change of the zoom factor does not need to change the sound pickup range.
- the first face image does not change, indicating that the content of the user's attention has not changed.
- user A interviews user B, and uses a mobile phone to photograph the interview process of user B.
- the mobile phone determines that the first face image in the video picture is the face image of user B.
- the mobile phone detects that the zoom factor has increased, but at this time, the first face image in the video screen is still the face image of user B. Then, the mobile phone does not need to acquire the first sound pickup range again, so as to reduce the amount of computation and save power consumption.
- the mobile phone detects the operation of changing the zoom factor multiple times, so it is not necessary to change the sound pickup range.
- the preset time period is 2s. After the mobile phone detects the operation of changing the zoom factor for the first time, it is not necessary to recalculate the pickup range. If the phone does not detect the operation of changing the zoom factor within 2s, the pickup range will be recalculated. If within 2s, the mobile phone detects the operation of changing the zoom factor again, it is not necessary to recalculate the pickup range. And take the time node at which the operation of changing the zoom factor is detected as the starting point, and monitor whether the operation of changing the zoom factor will be detected in the next 2s time period.
- the first sound pickup range changes.
- the above-mentioned switching scene between the front and rear cameras can also be understood as a change in the first face image and the first mouth image.
- the change of the voiced face image or the human mouth image causes the first human face image or the first human mouth image to change.
- the mobile phone confirms that the first face image is the two face images included in the video screen.
- the mobile phone identifies the first face image as the face image 182 on the right side of the video screen. Or, if the shooting picture moves, and the currently recorded video picture does not contain the previously recognized first face image or the first mouth image, the above method needs to be used to re-identify the first sound pickup range.
- the second sound pickup range is determined.
- the mobile phone uses the first sound pickup range corresponding to the recommended value to record a video before the duration of 00:30, and at 00:30 it is detected that the user clicks the enhancement value 2 to select Operation of Control 183.
- the mobile phone determines the second sound pickup range as the sound pickup range corresponding to the enhancement value 2, and displays the interface 1804 as shown in (d) in FIG. Pickup range to get audio.
- the mobile phone before generating an audio file of each channel of audio, the mobile phone can perform multiple sound effects processing on each channel of audio, so that the recorded audio can obtain higher audio quality and better audio processing effect.
- the sound effect processing may include: Dolby sound, Histen sound, sound retrieval system (SRS) sound, bass enhanced engine (BBE) sound, or dynamic bass enhanced engine (dynamic bass enhanced engine, DBEE) sound effects, etc.
- the mobile phone in order to prevent the frequent changes of the first characteristic value caused by the shaking of the mobile phone, resulting in frequent changes of the first sound pickup range, the mobile phone can set a preset time threshold, and the mobile phone will not change within the preset time threshold.
- the first pickup range For example, if it is set within 1s, and the first eigenvalue changes twice in a row, the mobile phone considers that the current change in the first eigenvalue is caused by the shaking of the mobile phone, and the corresponding first pickup range will not be changed.
- the mobile phone may process the audio signal based on the first sound pickup range while collecting the audio signal, so as to obtain the audio corresponding to the first face image or the first human mouth image. .
- the mobile phone may first collect the audio signal, and after the video recording is completed, process the audio signal according to the first sound pickup range to obtain the audio corresponding to the first face image or the first human mouth image.
- the mobile phone calls the corresponding microphone to collect the audio signal within the first sound pickup range, and obtains the audio corresponding to the first face image or the first mouth image after processing.
- the recording function may include a single-channel recording function and a multi-channel recording function.
- the single-channel recording function refers to displaying a viewfinder frame during the shooting process of the mobile phone, which is used for recording a video image of one channel.
- the multi-channel recording function means that the mobile phone displays at least two viewfinder frames during the shooting process, and each viewfinder frame is used for one video frame.
- each channel of video images and the corresponding audio collection method can refer to the implementation method of the single-channel recording function.
- the shooting interface includes a viewfinder as an example for description.
- the process corresponding to the multi-channel video recording function including two or more viewfinder frames is similar to this, and will not be described repeatedly.
- the mobile phone determines the first sound pickup range according to the voice-emitting face image or the voice-emitting mouth image, and then records audio according to the first voice pickup range. Subsequently, the recorded audio needs to be saved, and the user can play the video image and audio of the saved video.
- the scene of recording the video screen is a real-time communication scene such as live broadcast, video call, etc.
- the method of recording audio during the process of recording the video screen can refer to the above method, but when the user instructs to stop the shooting operation is detected. After the operation of stopping the communication, the communication is stopped directly without generating a recorded video. It is understandable that, in some real-time communication scenarios, the user may also choose to save the recorded video.
- the mobile phone determines whether to save the recorded video in the real-time communication scene.
- the mobile phone stops recording video images and audio, and generates a video recording.
- the operation of the user instructing to stop shooting may be the operation of the user clicking on the displayed control 45 in the video preview interface 403 shown in (c) in FIG.
- the embodiments of the present application do not make specific limitations.
- the mobile phone after detecting an operation instructed by the user to stop shooting, the mobile phone generates a video recording and returns to the video recording preview interface or the shooting preview interface.
- the recorded video may include video images and audio.
- the thumbnail image of the recorded video generated by the mobile phone refer to the thumbnail image 191 displayed in the interface 1901 shown in (a) of FIG. 19 , or the thumbnail image 192 displayed in the interface 1902 shown in (b) of FIG. 19 . .
- the mobile phone may prompt the user that the recorded video has multiple audio channels.
- the video thumbnail or the detailed information of the recorded video may include prompt information for representing multiple audio channels, for example, the prompt information may be multiple speakers displayed on the interface 1902 shown in (b) of FIG. 19 . mark 193, other forms of mark, or text information, etc.
- each channel of audio may respectively correspond to the audio correspondingly collected in the first sound pickup range and the reference first sound pickup range.
- the mobile phone in response to the user's instruction to stop the operation of shooting, displays an interface 1903 as shown in (c) in FIG. 19 , which is used to prompt the user to save the audio of the desired video file.
- the video file currently contains audios 194-197, which respectively correspond to audio files recorded in different pickup ranges, or correspond to audio files processed by different audio algorithms in the same pickup range.
- audios 194-197 correspond to audios with recommended value, enhancement value 1, enhancement value 2, and enhancement value 3, respectively.
- the mobile phone can play the video file and the corresponding audio.
- the mobile phone detects that the user has instructed to play the audio 194 , it will play the video file and the audio 194 . After watching the video file, the user can select the audio with better audio effect and save it. In response to the user's selection, the audio that the user needs to save is determined, so as to improve the user's use experience, and avoid the problem of excessive storage space occupation caused by saving too much audio.
- the user of the current video file selects to save the audio 194 and the audio 197 .
- the mobile phone completes the saving of the video file, and displays the interface 1902 as shown in (b) of FIG. 19 .
- the number of speakers in the speaker mark 193 may correspond to the number of audios contained in the current video file.
- the mobile phone plays the video image and audio of the recorded video.
- the operation of the user instructing to play the recorded video may be an operation of the user clicking the thumbnail 191 in the recording preview interface shown in (a) of FIG. 19 .
- the operation of the user instructing to play the recorded video may be the operation of the user clicking the thumbnail 192 in the gallery shown in (b) of FIG. 19 .
- the mobile phone plays the recorded video according to the video picture and audio recorded in the above-mentioned recording process.
- the mobile phone can display a video playback interface, and the video playback interface can include recorded video images.
- the mobile phone can play the audio corresponding to the first sound pickup range by default, and then switch to play other audio according to the user's instruction.
- the user has selected a specific sound pickup range, and the mobile phone automatically plays the audio corresponding to the sound pickup range selected by the user.
- the video playback interface may include multiple audio switching controls, and each audio switching control corresponds to a channel of audio. After the mobile phone detects that the user clicks an operation of an audio switching control, the audio of the channel corresponding to the audio switching control is played.
- the mobile phone may display a video playback interface 2001 as shown in (a) of FIG. 20 , and the video playback interface 2001 displays a video image. Audio switching controls 201-205 are also displayed on the video playback interface 2001 . As shown in (a) of FIG. 20 , the audio switching control 201 currently selected by the mobile phone, or the recommended value is selected by default, the audio corresponding to the first sound pickup range is played. If the mobile phone detects that the user clicks on the audio switching control 203, the audio corresponding to the reference first sound pickup range corresponding to the audio switching control 203 can be played.
- the mobile phone may delete part of the audio corresponding to the video file in response to the user's operation.
- the mobile phone detects that the user has long pressed the audio switching control 205, and displays a deletion prompt box. If the user confirms the deletion, the audio corresponding to the audio switching control 205 is deleted, and the interface 2003 shown in (c) of FIG. 20 is displayed. In the interface 2003, the audio control 205 corresponding to the audio whose deletion has been confirmed by the user is no longer displayed. In this way, during the video playback process, the audio that the user does not want to save can be deleted according to the user's requirements, thereby improving the user experience.
- the mobile phone may display a video playback interface without playing audio first. After detecting the user's instruction operation, the mobile phone plays the audio indicated by the user.
- the mobile phone can play the audio corresponding to the first face image or the first human mouth image, so that the played audio can reduce the effect of noise on the sound of the uttering face or the uttering mouth. interference, and the played audio matches the face image that the user is concerned about in real time, improving the user's audio experience.
- the mobile phone can switch and play the audio corresponding to different sound pickup ranges, providing the user with a variety of audio playback options, realizing the adjustability of the audio, and improving the user's audio playback experience.
- the mobile phone can play the real-time changing first face image or the first mouth image and the audio corresponding to the first feature value, so that the audio is matched with the changed video image in real time, and the user's audio experience is improved.
- FIG. 21 is a schematic flowchart of another audio processing method provided by an embodiment of the present application.
- the audio processing method can be applied to the electronic device 100 shown in FIG. 1 .
- the electronic device after the electronic device detects an operation instructing the user to turn on the camera, the electronic device starts the camera and displays a shooting preview interface. After that, after detecting an operation instructed by the user to shoot, the video image and the first audio (ie, the initial audio signal) are started to be collected.
- the image captured by the camera of the electronic device is the initial video image, and after the initial video image is processed, a video image that can be displayed on the display screen is obtained.
- the step of processing the initial video image is performed by the processor.
- the video frame captured by the camera is only an exemplary illustration.
- the electronic device starts the voice enhancement mode in response to the user's operation before or after detecting the operation instructing the user to shoot. Or, after detecting the operation instructing the user to shoot, the electronic device starts the voice enhancement mode.
- the first audio is audio signals in various directions collected by one or more microphones of the electronic device. Subsequently, the voice-enhanced audio may be obtained based on the first audio.
- the processor includes a GPU, an NPU, and an AP as an example for illustration. It can be understood that, the steps performed by the GPU, NPU, and AP here may also be performed by other processing units in the processor, which are not limited in this embodiment of the present application.
- the NPU in the processor uses image recognition technology to recognize whether the video picture contains a face image and/or a human mouth image. Further, the NPU can also confirm the voice-producing face or the voice-producing mouth according to the data of the face image and/or the mouth image, so as to confirm the sound pickup range that needs to perform directional recording.
- the target image can be used to determine the first feature value of the target image, and then the first sound pickup range can be determined according to the first feature value.
- the first feature value includes one or more items of pre- and post-position attribute parameters, area ratio, and location information.
- the front and rear attribute parameters are used to indicate whether the video screen is a video screen shot by the front camera or a video screen shot by the rear camera;
- the area ratio is used to indicate the ratio of the area of the target image to the area of the video screen; location information , which is used to indicate the position of the target image in the video frame.
- the first feature value includes pre- and post-position attribute parameters corresponding to the target image. That is to say, the AP in the processor determines whether the video picture where the current target image is located is the front video picture or the rear video picture. If it is a front video image, the first sound pickup range is the sound pickup range on the front camera side. If it is a rear video image, the first sound pickup range is the sound pickup range on the rear camera side.
- the first feature value includes the area ratio corresponding to the target image.
- the "area ratio” is used to indicate the ratio of the area of the first face image or the area of the first mouth image to the area of the video screen (for example, expressed by X/Y).
- the electronic device determines the first feature value according to the ratio of the area of the first face image to the area of the viewfinder.
- the first feature value includes position information corresponding to the target image.
- the AP determines the position of the first sound pickup range corresponding to the target image within the sound pickup range of the first audio according to the position information of the target image in the video picture. Specifically, the AP determines the first offset of the center point of the target image relative to the first reference point, where the first reference point is the center point or the focus of the video image. After that, the AP determines a second offset of the center point of the first sound pickup range relative to the center point of the first audio pickup range, and the second offset is proportional to the first offset to obtain the first sound pickup Scope.
- the first offset amount or the second offset amount includes an offset angle and/or an offset distance.
- the offset of the center of the target image relative to the reference point includes an offset angle ⁇ 1 and an offset distance L1.
- the AP can determine the first sound pickup range by using one or any combination of front and rear attribute parameters, area ratio, and location information.
- the AP in the processor uses the first audio collected by one or more microphones to enhance the audio signal within the first sound pickup range, and/or attenuate the first sound pickup range. For audio signals outside the sound pickup range, the audio corresponding to the first face image or the first human mouth image is obtained, that is, the second audio is obtained.
- the AP may call the microphone corresponding to the first sound pickup range to enhance the audio signal within the first sound pickup range, so that the volume within the first sound pickup range is greater than the volume outside the first sound pickup range.
- the electronic device includes one or more microphones, and the one or more microphones are used to collect the first audio.
- the sound pickup range of the first microphone in the one or more microphones includes part or all of the first sound pickup range
- the electronic device includes at least two microphones, and the at least two microphones are used to collect the first audio.
- the second microphone is turned off, and the audio collected by the other microphones of the at least two microphones except the second microphone is the first face The audio corresponding to the image or the first human mouth image.
- the second microphone is turned off, enhance the audio signal in the first sound pickup range in the sound pickup range of other microphones in the at least two microphones except the second microphone, and/or attenuate the audio signals in the at least two microphones except the second microphone. Audio signals outside the first sound pickup range in the sound pickup ranges of other microphones.
- the AP in the processor obtains the recorded video by using the obtained video picture. After detecting an operation instructing to stop shooting, a recorded video including the second audio and video images is obtained.
- the recorded video may contain multiple audio files, wherein each audio file contains a channel of audio.
- each audio file contains a channel of audio.
- the electronic device may be in the vicinity of the first sound pickup range
- One or more reference first pickup ranges are determined.
- the electronic device obtains one channel of audio according to the first sound pickup range, and obtains at least one channel of audio according to the reference first sound pickup range, and the electronic device may also use panoramic audio as one channel of audio.
- the electronic device can obtain the multi-channel audio corresponding to the first face image or the first human mouth image based on the first sound pickup range.
- one channel of audio can be understood as an audio file.
- the user can choose to delete part of the audio, and save the audio that he considers the best, so as to improve the user experience and reduce the storage pressure of the memory.
- Embodiments of the present application further provide an electronic device, including one or more processors and one or more memories.
- the one or more memories are coupled to the one or more processors for storing computer program code, the computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform
- the above related method steps implement the audio processing method in the above embodiment.
- An embodiment of the present application further provides a chip system, including: a processor, where the processor is coupled with a memory, the memory is used to store a program or an instruction, and when the program or instruction is executed by the processor, the The chip system implements the method in any of the foregoing method embodiments.
- the number of processors in the chip system may be one or more.
- the processor can be implemented by hardware or by software.
- the processor may be a logic circuit, an integrated circuit, or the like.
- the processor may be a general-purpose processor implemented by reading software codes stored in memory.
- the memory may be integrated with the processor, or may be provided separately from the processor, which is not limited in this application.
- the memory can be a non-transitory processor, such as a read-only memory ROM, which can be integrated with the processor on the same chip, or can be provided on different chips.
- the setting method of the processor is not particularly limited.
- the system-on-chip may be a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or a system on chip (SoC), It can also be a central processing unit (CPU), a network processor (NP), a digital signal processing circuit (DSP), or a microcontroller (microcontroller).
- controller unit, MCU it can also be a programmable logic device (PLD) or other integrated chips.
- each step in the above method embodiments may be implemented by a hardware integrated logic circuit in a processor or an instruction in the form of software.
- the method steps disclosed in conjunction with the embodiments of the present application may be directly embodied as being executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
- Embodiments of the present application further provide a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium.
- the terminal device executes the above-mentioned related method steps to achieve the above-mentioned embodiments. audio processing method.
- Embodiments of the present application further provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the above-mentioned relevant steps, so as to realize the audio processing method in the above-mentioned embodiment.
- embodiments of the present application further provide an apparatus, which may specifically be a component or a module, and the apparatus may include a connected processor and a memory; wherein, the memory is used to store instructions for execution by a computer, and when the apparatus is running, the processor The computer-executed instructions stored in the executable memory can be executed, so that the apparatus executes the audio processing methods in the above method embodiments.
- the terminal device, computer-readable storage medium, computer program product, or chip provided in the embodiments of the present application are all used to execute the corresponding methods provided above. Therefore, for the beneficial effects that can be achieved, reference may be made to the above-mentioned methods. The beneficial effects in the corresponding method are not repeated here.
- the electronic device includes corresponding hardware and/or software modules for executing each function.
- the present application can be implemented in hardware or in the form of a combination of hardware and computer software in conjunction with the algorithm steps of each example described in conjunction with the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functionality for each particular application in conjunction with the embodiments, but such implementations should not be considered beyond the scope of this application.
- the electronic device can be divided into functional modules according to the above method examples.
- each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
- the above-mentioned integrated modules can be implemented in the form of hardware. It should be noted that, the division of modules in this embodiment is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
- the disclosed method may be implemented in other manners.
- the terminal device embodiments described above are only illustrative.
- the division of the modules or units is only a logical function division.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of modules or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program instructions.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Otolaryngology (AREA)
- Human Computer Interaction (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Studio Devices (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
| 坐标关系 | 偏移方向 |
| X1<X2 | 向左 |
| X1>X2 | 向右 |
| X1=X2 | 左右未偏移 |
| Y1<Y2 | 向下 |
| Y1>Y2 | 向上 |
| Y1=Y2 | 上下未偏移 |
| 推荐值 | 增强值1 | 增强值2 | 增强值3 |
| N*X/Y | 1.1*N*X/Y | 0.95*N*X/Y | 1.05*N*X/Y |
| 推荐值 | 增强值1 | 增强值2 | 增强值3 |
| 算法1 | 算法2 | 算法3 | 算法4 |
Claims (20)
- 一种音频处理方法,其特征在于,所述方法应用于电子设备,所述方法包括:检测打开相机应用的第一操作;响应于所述第一操作,显示拍摄预览界面;检测开始录像的第二操作;响应于所述第二操作,采集视频画面和第一音频,并显示拍摄界面,所述拍摄界面包括所述视频画面的预览界面;识别所述视频画面中的目标图像,所述目标图像为第一人脸图像和/或第一人嘴图像;其中,所述第一人脸图像为所述视频图像中的发声对象的人脸图像,所述第一人嘴图像为所述视频图像中的发声对象的人嘴图像;根据所述目标图像,确定所述发声对象对应的第一拾音范围;根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频,所述第二音频中所述第一拾音范围内的音频音量大于所述第一拾音范围之外的音频音量。
- 根据权利要求1所述的方法,其特征在于,所述根据所述目标图像,确定所述发声对象对应的第一拾音范围;包括:根据所述目标图像,获得第一特征值;其中,所述第一特征值包括前后置属性参数,面积占比,位置信息中的一项或几项;其中,所述前后置属性参数,用于表示所述视频画面为前置摄像头拍摄的视频画面还是后置摄像头拍摄的视频画面;所述面积占比,用于表示所述目标图像的面积与所述视频画面的面积的比值;所述位置信息,用于表示所述目标图像在所述视频画面中的位置;根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围。
- 根据权利要求2所述的方法,其特征在于,所述根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围,包括:当所述视频画面为前置视频画面时,确定所述第一拾音范围为前置摄像头侧的拾音范围;当所述视频画面为后置视频画面时,确定所述第一拾音范围为后置摄像头侧的拾音范围。
- 根据权利要求2所述的方法,其特征在于,所述根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围,包括:根据所述面积占比以及所述第一音频的拾音范围,确定所述第一拾音范围。
- 根据权利要求2所述的方法,其特征在于,所述根据所述第一特征值,确定所述发声对象对应的所述第一拾音范围,包括:根据所述位置信息,确定所述第一拾音范围在所述第一音频的拾音范围中的位置。
- 根据权利要求5所述的方法,其特征在于,所述位置信息包括所述目标图像的中心点相对于第一参考点的第一偏移量,所述第一参考点为所述视频画面的中心点或对焦的焦点;所述根据所述位置信息,确定所述第一拾音范围在所述第一音频的拾音范围中的位置,包括:根据所述第一偏移量,确定所述第一拾音范围的中心点相对于所述第一音频的拾 音范围的中心点的第二偏移量,所述第二偏移量与所述第一偏移量成正比;根据所述第二偏移量,确定所述第一拾音范围在所述第一音频的拾音范围中的位置。
- 根据权利要求5或6所述的方法,其特征在于,所述视频画面的中心点为的取景框的中心点,或者所述视频画面的中心点为的显示屏的中心点。
- 根据权利要求1-7任一项所述的方法,其特征在于,所述根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频;包括:增强所述第一音频中在所述第一拾音范围以内的音频信号,和/或削弱所述第一音频中在所述第一拾音范围以外的音频信号,获得所述第二音频。
- 根据权利要求8所述的方法,其特征在于,所述电子设备包含一个或多个麦克风,所述一个或多个麦克风用于采集所述第一音频;所述根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频,包括:当所述一个或多个麦克风中第一麦克风的拾音范围内包含所述第一拾音范围的部分或全部时,执行以下至少一个操作得到所述第二音频:增强所述第一麦克风的拾音范围中所述第一拾音范围内的音频信号;削弱所述第一麦克风的拾音范围中所述第一拾音范围外的音频信号;削弱所述一个或多个麦克风中除所述第一麦克风外的其他麦克风的音频信号。
- 根据权利要求8所述的方法,其特征在于,所述电子设备包含至少两个麦克风,所述至少两个麦克风用于采集所述第一音频;所述根据所述第一拾音范围和所述第一音频,获得所述视频画面对应的第二音频,包括:当所述至少两个麦克风中第二麦克风的拾音范围不包含所述第一拾音范围时,关闭所述第二麦克风,所述至少两个麦克风中除所述第二麦克风外的其他麦克风采集的音频为所述第二音频。
- 根据权利要求10所述的方法,其特征在于,在关闭所述第二麦克风时,所述方法还包括:增强所述至少两个麦克风中除所述第二麦克风外的其他麦克风的拾音范围中所述第一拾音范围内的音频信号,和/或削弱至少两个麦克风中除所述第二麦克风外的其他麦克风的拾音范围中所述第一拾音范围外的音频信号。
- 根据权利要求2-11任一项所述的方法,其特征在于,所述第一人脸图像的数量为一个或多个,所述第一人嘴的数量为一个或多个。
- 根据权利要求1-12任一项所述的方法,其特征在于,在所述响应于所述第二操作,采集视频画面和第一音频,并显示拍摄界面之后,所述方法还包括:检测停止拍摄的第三操作;响应于所述第三操作,停止录制并生成录像视频;所述录像视频包括所述视频画面,以及所述第二音频;检测播放所述录像视频的第四操作;响应于所述第四操作,显示视频播放界面,播放所述视频画面,以及所述第二音 频。
- 根据权利要求13所述的方法,其特征在于,所述录像视频还包括第三音频,所述第三音频为根据第二拾音范围确定的音频,所述第二拾音范围为根据所述第一拾音范围确定,且与所述第一拾音范围不同的拾音范围;所述视频播放界面包括第一控件和第二控件,所述第一控件对应所述第二音频,所述第二控件对应第三音频。
- 根据权利要求14所述的方法,其特征在于,所述方法还包括:响应于所述第四操作,播放所述视频画面和所述第二音频;所述第四操作包括操作播放控件的操作或操作所述第一控件的操作;检测操作所述第二控件的第五操作;响应于所述第五操作,播放所述视频画面和所述第三音频。
- 根据权利要求14或15所述的方法,其特征在于,所述方法还包括:响应于删除所述第二音频或所述第三音频的操作,删除所述第二音频或所述第三音频。
- 根据权利要求1-16任一项所述的方法,其特征在于,在所述响应于所述第一操作,显示拍摄预览界面之后,所述方法还包括:检测启动语音增强模式的第六操作;响应于所述第六操作,启动语音增强模式。
- 一种电子设备,其特征在于,包括:处理器,存储器,麦克风,摄像头和显示屏,所述存储器、所述麦克风、所述摄像头、所述显示屏与所述处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器从所述存储器中读取所述计算机指令,使得所述电子设备执行如权利要求1-17任一项所述的音频处理方法。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-17中任一项所述的音频处理方法。
- 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1-17中任一项所述的音频处理方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21860008.8A EP4192004A4 (en) | 2020-08-26 | 2021-07-26 | AUDIO PROCESSING METHOD AND ELECTRONIC DEVICE |
| JP2023513516A JP7583914B2 (ja) | 2020-08-26 | 2021-07-26 | オーディオ処理方法および電子デバイス |
| US18/042,753 US12245006B2 (en) | 2020-08-26 | 2021-07-26 | Audio processing method and electronic device |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010868463.5 | 2020-08-26 | ||
| CN202010868463.5A CN113556501A (zh) | 2020-08-26 | 2020-08-26 | 音频处理方法及电子设备 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022042168A1 true WO2022042168A1 (zh) | 2022-03-03 |
Family
ID=78101621
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/108458 Ceased WO2022042168A1 (zh) | 2020-08-26 | 2021-07-26 | 音频处理方法及电子设备 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US12245006B2 (zh) |
| EP (1) | EP4192004A4 (zh) |
| JP (1) | JP7583914B2 (zh) |
| CN (1) | CN113556501A (zh) |
| WO (1) | WO2022042168A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023231686A1 (zh) * | 2022-05-30 | 2023-12-07 | 荣耀终端有限公司 | 一种视频处理方法和终端 |
| EP4485910A4 (en) * | 2022-04-19 | 2025-10-08 | Huawei Tech Co Ltd | DIRECTIONAL SOUND CAPTURE METHOD AND DEVICE |
Families Citing this family (23)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10009536B2 (en) | 2016-06-12 | 2018-06-26 | Apple Inc. | Applying a simulated optical effect based on data received from multiple camera sensors |
| DK180859B1 (en) | 2017-06-04 | 2022-05-23 | Apple Inc | USER INTERFACE CAMERA EFFECTS |
| US11112964B2 (en) | 2018-02-09 | 2021-09-07 | Apple Inc. | Media capture lock affordance for graphical user interface |
| US11722764B2 (en) | 2018-05-07 | 2023-08-08 | Apple Inc. | Creative camera |
| DK201870623A1 (en) | 2018-09-11 | 2020-04-15 | Apple Inc. | User interfaces for simulated depth effects |
| US11321857B2 (en) | 2018-09-28 | 2022-05-03 | Apple Inc. | Displaying and editing images with depth information |
| US11770601B2 (en) | 2019-05-06 | 2023-09-26 | Apple Inc. | User interfaces for capturing and managing visual media |
| US11054973B1 (en) | 2020-06-01 | 2021-07-06 | Apple Inc. | User interfaces for managing media |
| US11212449B1 (en) * | 2020-09-25 | 2021-12-28 | Apple Inc. | User interfaces for media capture and management |
| JP7651350B2 (ja) * | 2021-03-30 | 2025-03-26 | キヤノン株式会社 | 制御装置及びその制御方法及びプログラム及び記録媒体 |
| US11778339B2 (en) | 2021-04-30 | 2023-10-03 | Apple Inc. | User interfaces for altering visual media |
| US12112024B2 (en) | 2021-06-01 | 2024-10-08 | Apple Inc. | User interfaces for managing media styles |
| CN113990340B (zh) * | 2021-11-22 | 2024-12-31 | 北京声智科技有限公司 | 音频信号的处理方法、装置、终端及存储介质 |
| US12506953B2 (en) | 2021-12-03 | 2025-12-23 | Apple Inc. | Device, methods, and graphical user interfaces for capturing and displaying media |
| TWI831175B (zh) * | 2022-04-08 | 2024-02-01 | 驊訊電子企業股份有限公司 | 虛擬實境提供裝置與音頻處理方法 |
| CN114679647B (zh) * | 2022-05-30 | 2022-08-30 | 杭州艾力特数字科技有限公司 | 无线麦拾音距离的确定方法、装置、设备及可读存储介质 |
| CN116048448B (zh) * | 2022-07-26 | 2024-05-24 | 荣耀终端有限公司 | 一种音频播放方法及电子设备 |
| CN118741218A (zh) * | 2023-03-28 | 2024-10-01 | 华为技术有限公司 | 视频录制、播放的方法及电子设备 |
| US20240373121A1 (en) | 2023-05-05 | 2024-11-07 | Apple Inc. | User interfaces for controlling media capture settings |
| US12602154B2 (en) | 2024-01-18 | 2026-04-14 | Apple Inc. | User interfaces integrating hardware buttons |
| CN118301513A (zh) * | 2024-03-29 | 2024-07-05 | 联想(北京)有限公司 | 信息处理方法及电子设备 |
| CN118042329B (zh) * | 2024-04-11 | 2024-07-02 | 深圳波洛斯科技有限公司 | 基于会议场景的多麦克风阵列降噪方法及其系统 |
| CN120676113A (zh) * | 2025-08-12 | 2025-09-19 | 歌尔股份有限公司 | 音频录制方法、头戴设备及存储介质 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101101752A (zh) * | 2007-07-19 | 2008-01-09 | 华中科技大学 | 基于视觉特征的单音节语言唇读识别系统 |
| US20150279364A1 (en) * | 2014-03-29 | 2015-10-01 | Ajay Krishnan | Mouth-Phoneme Model for Computerized Lip Reading |
| US9728203B2 (en) * | 2011-05-02 | 2017-08-08 | Microsoft Technology Licensing, Llc | Photo-realistic synthesis of image sequences with lip movements synchronized with speech |
| CN108711430A (zh) * | 2018-04-28 | 2018-10-26 | 广东美的制冷设备有限公司 | 语音识别方法、智能设备及存储介质 |
| CN109145853A (zh) * | 2018-08-31 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | 用于识别噪音的方法和装置 |
| CN109413563A (zh) * | 2018-10-25 | 2019-03-01 | Oppo广东移动通信有限公司 | 视频的音效处理方法及相关产品 |
| CN110310668A (zh) * | 2019-05-21 | 2019-10-08 | 深圳壹账通智能科技有限公司 | 静音检测方法、系统、设备及计算机可读存储介质 |
Family Cites Families (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP5214394B2 (ja) | 2008-10-09 | 2013-06-19 | オリンパスイメージング株式会社 | カメラ |
| US9094645B2 (en) * | 2009-07-17 | 2015-07-28 | Lg Electronics Inc. | Method for processing sound source in terminal and terminal using the same |
| JP2011061461A (ja) | 2009-09-09 | 2011-03-24 | Sony Corp | 撮像装置、指向性制御方法及びそのプログラム |
| KR20110038313A (ko) * | 2009-10-08 | 2011-04-14 | 삼성전자주식회사 | 영상촬영장치 및 그 제어방법 |
| JP5523178B2 (ja) | 2010-04-14 | 2014-06-18 | オリンパスイメージング株式会社 | 記録装置 |
| CN102572356B (zh) * | 2012-01-16 | 2014-09-03 | 华为技术有限公司 | 记录会议的方法和会议系统 |
| JP2013179466A (ja) | 2012-02-28 | 2013-09-09 | Nikon Corp | 撮像装置 |
| US9258644B2 (en) * | 2012-07-27 | 2016-02-09 | Nokia Technologies Oy | Method and apparatus for microphone beamforming |
| US20150022636A1 (en) * | 2013-07-19 | 2015-01-22 | Nvidia Corporation | Method and system for voice capture using face detection in noisy environments |
| US9596437B2 (en) * | 2013-08-21 | 2017-03-14 | Microsoft Technology Licensing, Llc | Audio focusing via multiple microphones |
| CN104699445A (zh) * | 2013-12-06 | 2015-06-10 | 华为技术有限公司 | 一种音频信息处理方法及装置 |
| US9913027B2 (en) * | 2014-05-08 | 2018-03-06 | Intel Corporation | Audio signal beam forming |
| CN106486147A (zh) * | 2015-08-26 | 2017-03-08 | 华为终端(东莞)有限公司 | 指向性录音方法、装置及录音设备 |
| CN107402739A (zh) * | 2017-07-26 | 2017-11-28 | 北京小米移动软件有限公司 | 一种拾音方法及装置 |
| CN111050269B (zh) * | 2018-10-15 | 2021-11-19 | 华为技术有限公司 | 音频处理方法和电子设备 |
| CN110366065A (zh) * | 2019-07-24 | 2019-10-22 | 长沙世邦通信技术有限公司 | 定向跟随人脸位置拾音的方法、装置、系统及存储介质 |
-
2020
- 2020-08-26 CN CN202010868463.5A patent/CN113556501A/zh active Pending
-
2021
- 2021-07-26 JP JP2023513516A patent/JP7583914B2/ja active Active
- 2021-07-26 EP EP21860008.8A patent/EP4192004A4/en active Pending
- 2021-07-26 WO PCT/CN2021/108458 patent/WO2022042168A1/zh not_active Ceased
- 2021-07-26 US US18/042,753 patent/US12245006B2/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101101752A (zh) * | 2007-07-19 | 2008-01-09 | 华中科技大学 | 基于视觉特征的单音节语言唇读识别系统 |
| US9728203B2 (en) * | 2011-05-02 | 2017-08-08 | Microsoft Technology Licensing, Llc | Photo-realistic synthesis of image sequences with lip movements synchronized with speech |
| US20150279364A1 (en) * | 2014-03-29 | 2015-10-01 | Ajay Krishnan | Mouth-Phoneme Model for Computerized Lip Reading |
| CN108711430A (zh) * | 2018-04-28 | 2018-10-26 | 广东美的制冷设备有限公司 | 语音识别方法、智能设备及存储介质 |
| CN109145853A (zh) * | 2018-08-31 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | 用于识别噪音的方法和装置 |
| CN109413563A (zh) * | 2018-10-25 | 2019-03-01 | Oppo广东移动通信有限公司 | 视频的音效处理方法及相关产品 |
| CN110310668A (zh) * | 2019-05-21 | 2019-10-08 | 深圳壹账通智能科技有限公司 | 静音检测方法、系统、设备及计算机可读存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4192004A4 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4485910A4 (en) * | 2022-04-19 | 2025-10-08 | Huawei Tech Co Ltd | DIRECTIONAL SOUND CAPTURE METHOD AND DEVICE |
| WO2023231686A1 (zh) * | 2022-05-30 | 2023-12-07 | 荣耀终端有限公司 | 一种视频处理方法和终端 |
| US12538012B2 (en) | 2022-05-30 | 2026-01-27 | Honor Device Co., Ltd. | Video processing method and terminal |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2023540908A (ja) | 2023-09-27 |
| CN113556501A (zh) | 2021-10-26 |
| EP4192004A1 (en) | 2023-06-07 |
| US20230328429A1 (en) | 2023-10-12 |
| US12245006B2 (en) | 2025-03-04 |
| EP4192004A4 (en) | 2024-02-21 |
| JP7583914B2 (ja) | 2024-11-14 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022042168A1 (zh) | 音频处理方法及电子设备 | |
| CN113365012B (zh) | 一种音频处理方法及设备 | |
| EP4099688A1 (en) | Audio processing method and device | |
| US12192614B2 (en) | Photographing method in long-focus scenario and terminal | |
| EP3944063A1 (en) | Screen capture method and electronic device | |
| EP4044578B1 (en) | Audio processing method and electronic device | |
| EP3873084B1 (en) | Method for photographing long-exposure image and electronic device | |
| WO2021078001A1 (zh) | 一种图像增强方法及装置 | |
| CN110506416A (zh) | 一种终端切换摄像头的方法及终端 | |
| US12375866B2 (en) | Audio processing method and electronic device | |
| US12382163B2 (en) | Shooting method and related device | |
| WO2022062985A1 (zh) | 视频特效添加方法、装置及终端设备 | |
| WO2025030802A1 (zh) | 显示方法、电子设备和系统 | |
| CN118450254B (zh) | 一种终端设备的对焦优化方法、电子设备及存储介质 | |
| HK40101788A (zh) | 一种拍摄方法及相关设备 | |
| WO2026082099A1 (zh) | 图像处理方法、摄像头及电子设备 | |
| HK40101788B (zh) | 一种拍摄方法及相关设备 | |
| CN117221707A (zh) | 一种视频处理方法和终端 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21860008 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2023513516 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202317014457 Country of ref document: IN |
|
| ENP | Entry into the national phase |
Ref document number: 2021860008 Country of ref document: EP Effective date: 20230302 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 18042753 Country of ref document: US |
|
| WWG | Wipo information: grant in national office |
Ref document number: 202317014457 Country of ref document: IN |



