EP4677546A1 - Bildbasierte rekonstruktion von 3d-landmarken zur verwendung bei der erzeugung personalisierter kopfbezogener übertragungsfunktionen - Google Patents

Bildbasierte rekonstruktion von 3d-landmarken zur verwendung bei der erzeugung personalisierter kopfbezogener übertragungsfunktionen

Info

Publication number: EP4677546A1
Authority: EP; European Patent Office
Prior art keywords: images; ear; distance; image; head
Prior art date: 2023-03-10
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP24717865.0A

Other languages

English (en)

French (fr)

Inventor

Andrea FANELLI

Hailong SHI

Xuemei Yu

Deepak Chandran

Gabriel Antonio MARTINEZ MONTES

Yifei Liu

Hao Luo

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Dolby Laboratories Licensing Corp

Original Assignee

Dolby Laboratories Licensing Corp

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2023-03-10

Filing date

2024-03-07

Publication date

2026-01-14

2024-03-07 Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp

2026-01-14 Publication of EP4677546A1 publication Critical patent/EP4677546A1/de

Status Pending legal-status Critical Current

Links

Classifications

- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
- G06T7/593—Depth or shape recovery from multiple images from stereo images
- G06T7/596—Depth or shape recovery from multiple images from stereo images from three or more stereo images
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/75—Determining position or orientation of objects or cameras using feature-based methods involving models
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face

Definitions

HRTFs Head Related Transfer Functions
the functions typically describe linear filtering processes that reflect the acoustic effect of the ears, head and torso on incoming sound waves.
Rendering audio using HRTFs is referred to as “binaural rendering”, and the resulting audio is referred to as “binaural audio”.
HRTFs may be defined in many ways, including as time-domain impulse responses or as frequency-domain responses. HRTFs are typically grouped in pairs, to provide a response for each ear. An HRTF filter pair may be used to provide a listener with an experience that mimics the sound (at each ear) that would occur when the audio signal was presented from a particular direction of arrival. Different HRTF filter pairs will produce the illusion of differing sound-source directions, timber, and extemalization.
HRTFs personalized HRTFs
Personalized HRTFs can be obtained through experimental measurement procedures, or modelled using personalized information pertaining to the user. Dimensions and orientation of ear and head features are examples of anthropometric information that may be used to generate personalized pRHTFs, Document US2021/0211825 discloses an example of such techniques.
anthropometric features e.g., landmarks and distances between landmarks
Existing methods for obtaining anthropometric features for pHRTF generation are complicated, often require specialize hardware, and their implementation is challenging on devices with limited sensing and/or processing capabilities, such as smartphones.
a first aspect of the present invention relates to a method for 3D reconstruction of anthropometric features from images of a subject, comprising obtaining a set of images of the subject from a feed of images captured with an image capturing device from a plurality of perspectives and for each image in the set detecting 2D locations of a plurality of anthropometric landmarks, and estimating a head pose based on the 2D locations.
the method further includes generating, based on the detected 2D locations and associated head poses for at least some of the images in the set of images, relative 3D coordinates corresponding to at least some of the anthropometric features captured in the set of images, and generating scaled 3D coordinates from the relative 3D coordinates based on properties associated with the image capturing device and an estimated distance between the subject and the image capturing device.
This approach provides an efficient way to obtain scaled 3D anthropometric features using a series of images obtained by simple means, e.g. using a mobile camera device such as a smartphone.
the anthropometric features are obtained by the detection of ear landmarks and/or face landmarks.
the set of images may be selected based on a desired uniform angular sampling frequency, which means that the images in the set have been acquired from a set of viewing angles with respect the subject (e.g. the subjects face) and that these viewing angles are uniformly distributed over relevant viewing angle ranges.
relevant yaw angle ranges may be 20-55 degrees on either side.
One way to achieve a uniform angular sampling frequency is to monitor the angular velocity of the subject (e.g., relative to the image capturing device) and adjust the sampling from the feed of images.
a motion metric representing the angular velocity may be based on data from sensors worn by the subject (e.g., IMU, accelerometers, gyroscopes) but may alternatively be provided by software in the image capturing device.
Another way to achieve a uniform angular sampling frequency is to simply use the head pose obtained for each image. In that case, a head pose may be obtained for a larger set of images, and a smaller set of images would then be selected for further processing.
a subset of anthropometric landmarks is detected only for a subset of images for which the head pose is within a pre-determined head pose range (i.e., for viewing angles with respect to the head within a certain range).
ear landmarks may be detected only for a subset of images for which the head pose is within a predetermined azimuth (yaw) range. For azimuth (yaw) angles substantially in front of the subject, the ears are not clearly visible, and detection of landmarks may be difficult or unreliable.
head pose estimation involves estimating a rotation transform (yaw, pitch, roll) that transforms the 2D anthropometric landmark locations into a generic 3D face mesh centered in the image.
the matching may be done in the least square sense.
head pose estimation may be achieved by making a request (call) to software in the image capturing device and obtaining a response with including head pose data.
a response may also include 2D locations for at least a subset of the anthropometric landmarks.
the estimated distance between the subject and the image capturing device may be detected using any appropriate distance detection technology, including structured light imaging or time-of- flight imaging, or any other hardware-based distance/depth sensor. Alternatively, it is typically possible to obtain a satisfactory distance estimate using statistical information about anthropometric features of the subject, such as iris size or inter-pupillary distance. Processing to obtain the estimated distance may be performed on the set of images, or on a different set of images obtained in a separate process. For example, it may be beneficial to obtain a subset of images of the subject looking straight ahead, and estimate the distance based on this subset of images.
the scaled 3D coordinates may be used to obtain personalized HRTFs.
a default set of head related transfer functions, HRTFs are modifying based on the scaled 3D coordinates.
the set of scaled 3D coordinates are applied to a generative model to generate data corresponding to one or more pHRTFs adapted to the subject.
a further aspect of the present invention relates to a method for determining a head radius, r, of a subject, comprising acquiring at least one image of the subject acquired using an image capturing device, estimating a distance from the image capturing device to a face plane of the subject, identifying 2D coordinates of the subject’s ear tragi in the at least one image, projecting the 2D coordinates onto an image plane of the image capture device, and estimating the head radius based on a distance between the projections of the ear tragi in the image plane and an estimated ratio between the head radius, r, and a distance between a face plane, to which the estimated distance relates, and an ear plane in which the ear tragi are located.
Figures la-lc illustrate a video capturing process for acquiring a set of images using a handheld device according to an embodiment of the present invention.
Figure 2 is a block diagram of a system according to an embodiment of the present invention.
Figure 3 is a flow chart of a method according to an embodiment of the present invention.
Figure 4 shows a set of ear landmarks on a head of a subject according to some embodiments of the present invention.
Figure 5 shows multi- angulation of 3D coordinates according to some embodiments of the present invention.
Figure 6 is a schematic illustration of imaging of face and ears of a subject.
Figure 7 is a schematic block diagram of an example device or architecture that may be used to implement embodiments of the invention.
the present disclosure describes techniques enabling reliable and accurate extraction of 3D locations and dimensions of anthropometric landmarks and features (e.g., relating to the ear and head) from image data (e.g., 2-dimensional or 2D images) captured by a wide variety of camera systems (e.g., a mobile phone, tablet, gaming device, laptop, etc. including a camera) using a variety of capture modalities. While these techniques are described with respect to anthropometric landmarks and features most relevant to pHRTF generation, it is understood that such techniques are generally useful for efficiently obtaining 3D dimensional and/or positional data corresponding to objects captured in 2D images.
the techniques described herein include strategies to infer 3D coordinates of anthropometric landmarks (e.g., head, ear, or torso landmarks) using a set of images showing a subject’s head and/or torso (e.g., a sequence of images or frames, which may or may not be obtained from video data).
the 3D coordinates are subsequently used as input information to generate personalized PHRTFs, which can be used to customized audio playback for a specific user.
Camera pose - may refer to camera position in the world coordinate system, expressed as (yaw, pitch, roll), for a frame/image (e.g., from a video).
Head pose - may refer to head orientation of the user’s head related to the camera, expressed as (yaw, pitch, roll).
Image Plane - may refer to the plane where the image is formed.
a 3D reconstruction process is achieved by a technique including the following steps:
upper body anthropometric landmark detection includes, for example, head landmark detection, torso landmark detection, facial landmark detection, ear landmark detection, etc.
an additional step of detecting presence and location of a face or an ear prior to 2D anthropometric landmark detection is performed.
ear landmark detection is limited to images/frames where an ear has been detected (e.g., each image/frame of the set of images/frames includes a detected ear).
facial landmark detection is limited to images/frames where a face has been detected (e.g., each image/frame in the set of images/frames includes a detected face).
detection of ear landmark coordinates is limited to images/frames corresponding to head pose values that are within a fixed range (e.g., [-10°, -45° ] azimuth range, or the [10°, 45° ] azimuth range).
the set of images/frames of the video to which landmark detection is applied is selected such that it uniformly samples the azimuth space. In other words, the angular separation between images in the set is roughly equal, so that the images in the set are evenly distributed around the face.
ear landmark detection includes a determination of data corresponding to ear landmarks that describe the anatomic characteristics of ear pinna, helix, anti-helix, tragi, fossa, and concha. Estimating the camera pose for each image/frame of the set of images/frames of the video by leveraging the head pose of the user’s head relative to the camera in each respective image/frame (e.g., 3D reconstruction assuming that the camera moves and the head remains still or the head rotates and the camera is still).
head pose information is obtained via an API call to external software.
An instantaneous face mesh is a face mesh computed for each frame in a feed, e.g. a video feed.
the instantaneous face mesh information is used to detect the face or ear location to improve accuracy of 2D landmark detection.
3D face landmark coordinates (x,y,z).
a 3D point is identified as the point that minimizes the error in intersecting the projection lines.
steps 1-3 are repeated.
additional video data is captured, and steps 1-3 are repeated for a set of images from the additional video data. 4.
the camera distance is estimated using a hardware depth sensor (e.g., a time-of-flight sensor, etc.). In some embodiments, the camera distance is estimated without a hardware depth sensor (e.g., using a depth from motion algorithm, multiple non-depth cameras, infrared camera/dot projector, etc.). b. In some embodiments, the camera distance is estimated by measuring user’s Iris diameter, assuming that iris’s size is constant over the human population, and scaling the world relative distances using the iris dimension as absolute scaling factor. c. In some embodiments, the camera distance is estimated by combining different strategies for distance scaling (e.g., iris diameter, pupil distance, tragi length, reference object, etc.). In some embodiments, results from multiple distance estimation strategies are combined using Bayesian inference.
a hardware depth sensor e.g., a time-of-flight sensor, etc.
the camera distance is estimated without a hardware depth sensor (e.g., using a depth from motion algorithm, multiple
estimating the head radius for each image/frame of the set of images/frames based on the projection of 2D face mesh information onto the image plane and the estimated distance of the head from the camera (e.g., camera distance).
the head radius is estimated based on the distance between the ear plane (e.g., a plane that includes the two tragi, coronal to the head) and a face plane (e.g., a plane parallel to the ear plane, minimizing the distance between the facial plane and facial mesh landmarks).
3D anthropometric landmarks e.g., distances, angles, areas
a first part of the process relates to acquiring the set of images of the subject (user) 1 including one or several anatomical features, such as ear(s) eye(s), to enable determination of anthropometric features.
the images may in principle be acquired as separate images, but may otherwise be selected from a video feed.
the image capture device may be a mobile capture device 1 such as a phone, tablet, or a stationary device such as a gaming device, workstation, etc..
Figures la-lc illustrate a video capturing process acquiring a feed of images of a subject (user) 1 using a hand-held image capturing device 2. The process would be similar if a stationary device was used.
the acquired images should include at least the head 3 of the subject 1.
the video capture is initialized with the subject 1 holding the camera 2 in front of his/her face and looking straight into the camera. Then, in figure lb, with the video acquisition in progress, the subject turns his/her head to the right (or left) and then back to the starting position. Finally, in figure 1c, the subject turns his/her head to the left (right) and then back to the starting position.
the feed will include images over a wide azimuth range (e.g., 180 degree, 110 degree, 90 degrees etc.). Further, by starting in a straight-ahead pose, the process can start by identifying anthropometric features in a front view, where many relevant features are visible. Further, by stopping at the starting position between the left and right turns, a “reset” of any offsets caused by the rotational motion can be ensured.
a wide azimuth range e.g. 180 degree, 110 degree, 90 degrees etc.
Figure 2 illustrates a block diagram of a system 10 according to some embodiments.
the system 10 is configured to process a set of input image frames 11 of a person’s upper body to extract 3D head, face, and ear landmark coordinates.
the set of images 11 are provided to three parallel modules - a 2D anthropometric landmark detection module 12, a head pose detection module 13 and a face distance estimation module 14. It is noted that although the modules 12, 13, 14 are illustrated in parallel, the processing of these modules may at least in part be sequential.
the head pose may be acquired by a request to an external software (e.g. available in the image capturing device) and such a request may also provide a face mesh useful for the landmark detection in module 12.
the results from the modules 12 and 13 are provided to a multi- angulation module 15 outputting relative 3D coordinates of anthropometric features. These coordinates are then provided to a scaling module 16, which generates scaled 3D coordinates using an estimated distance D from module 14 and parameters P relating to the image capturing device (such as focal length).
step S 11 (obtain images) a set of images of the subject are obtained from a feed of images captured with an image capturing device from a plurality of perspectives.
the processing then continues to step S12, which, together with steps S13 and S14 for a loop which is executed for each image in the set.
step S 13 detect 2D locations
2D locations of a plurality of anthropometric landmarks are detected.
the landmarks may be e.g., face landmarks or ear landmarks.
step S14 estimate head pose
a head pose is estimated based on the 2D locations.
a further selection of a subset of images may take place. Such selection may serve to ensure a subset of images with a uniform angular sampling frequency, i.e., images acquired with viewing angles with respect to the head of the subject evenly distributed in angular space.
the anthropometric landmarks detected in a specific image may depend on the estimated head pose. For example, for a head pose which is substantially facing straight ahead (into the camera) it may be impossible (or unreliable) to detect ear landmarks.
step S15 generate relative 3D coordinates
relative 3D coordinates corresponding to a set of the anthropometric landmarks based on the detected 2D locations of corresponding landmarks and associated head poses for the set of images.
the set of anthropometric landmarks for which 3D coordinates are determined may include all, or only some, of the 2D locations detected in each image. For example, as mentioned above, depending on the estimated head pose, some landmarks may be less relevant.
step S16 scaled 3D coordinates
scaled 3D coordinates are generated from the relative 3D coordinates based on properties associated with the image capturing device and an estimated distance between the subject and the image capturing device. Relevant properties may be focal length, pixel density, filed of view, etc.
the estimated distance may be obtained by appropriate distance/depth detection hardware, or by relating some of the detected anthropometric landmarks with known statistical data for certain anatomical features, such as pupil diameter or inter-pupillary distance.
2D anthropometric landmark detection is performed on image frames (e.g., individual images or images constituting a frame). Landmark detection can be achieved using different strategies.
2D landmark detection is performed using a deep neural network (DNN).
a DNN includes a MobileNet VI or V2 backbone, with a fully connected network on top and a regression activation function.
the DNN is trained using the mean Euclidean error between estimated (x,y) coordinates and annotated coordinates as cost function.
a large dataset of annotated images characterized by human head, torso, and ear landmarks is used for training.
Strategies such as image augmentation, regularization, dropout, and pooling can be used to improve the performance on the test data.
traditional computer vision strategies for landmark detection can be used in conjunction with or in place of the exemplary strategies described above.
a region relating to a specific relevant feature e.g., an ear or face
a region relating to a specific relevant feature can be estimated with instantaneous faceMesh and head pose information obtained from external sources (e.g., via API call or otherwise from external software), e.g. software available in the image capturing device.
This step is typically performed before ear and face landmarks detection, and may be used to reduce complexity and/or improve accuracy of subsequent processes.
Figure 4 illustrates exemplary results of 2D ear landmark detection in accordance with some embodiments.
a plurality of ear landmarks 18 are indicated in an image of a head 3.
Similar anthropometric landmark annotation can be applied to other upper body parts, such as head, face, or torso.
a head pose of each image in the set is determined.
the head pose can be considered a coordinate transform between the camera coordinate system and the head coordinate system.
a rotational transform in two or three degrees of freedom, e.g. yaw, pitch and roll.
camera pose information (the camera yaw, pitch, and roll in the world reference system) is obtained (e.g., determined, estimated, accessed, received, etc.) by measuring the head pose information (the yaw, pitch, and roll position of the head in relation to the image frame).
camera pose information is obtained by calling a head-pose API, e.g. available in the image capturing device, which returns information about head pose and this information used in the algorithms described herein.
a head-pose API e.g. available in the image capturing device
head pose information is measured directly.
known strategies and techniques for facial landmark detection such as the open-source libraries or other DNNs trained for this purpose, are leveraged.
camera pose estimation is performed on a subset of the images/frames of the video. For example, in accordance with some embodiments, frames where the estimated camera pose is determined to be missing, unreliable, or wrong, the camera pose values are corrected or re-estimated using machine learning or signal processing algorithms, such as regression or Kalman filtering.
module 15 relative 3D coordinates for a set of anthropometric features are determined. Specifically, using the head pose (rotational transform) expressed as yaw, pitch, and roll, and the 2D (x,y) coordinates of the landmarks in the image plane, the anthropometric landmarks can be projected in the world reference system and the 3D point in space where the projections approximately intersect can be identified as illustrated in figure 5.
head pose rotational transform
Figure 5 illustrates using head pose information for a set of images acquired from different viewing angles (k-1, k, k+1) to project the 2D (x,y) coordinates Pj k-i, Pj, k, Pj, k+i of a landmark in the images onto the 3D world reference system.
the approximate intersection of the projections identifies the 3D (x, y, z) coordinates Pj of the landmark in the world reference system.
several cameras are illustrated to represent images of the subject acquired from different viewing angles.
strategies include identifying the point p the minimizes the average squared distance between p and the projection lines, in least square terms.
strategies include identifying the point p that minimizes the mean Euclidean error between the original 2D landmarks (x,y) and the (x,y) coordinates obtained by re-projecting p back into the image frames, averaged over the image perspectives.
angular velocity could be used to infer the accuracy of the multiangulation process. Higher values of angular velocity are associated to fast head rotation, which might be associated to blurred images (e.g., images not suitable performing one or more processing steps). Slower values of angular velocity are instead associated to less blur and higher image definition. As such, in some embodiments, angular velocity is used to efficiently select a subset of images/frames for 2D/3D landmark recognition that are more likely to provide subtle data for feature extraction and ultimately pHRTF generation. Such selection could be made when obtaining the set of images in step S 11 , or could take place before the processing in step S 15 in figure 3, to select a subset of the set of images.
Multi-angulation error may be used as a surrogate for multi- angulation accuracy.
different metrics of multi- angulation error are used for this purpose. For example, we can measure the multi-angulation error as the mean distance of p from the projection lines, or as the mean Euclidean distance between the original (x,y) landmarks in the image planes and the corresponding (x,y) coordinates obtained by retro-projecting p into the image planes, over all the image perspectives.
multi-angulation accuracy is used to steer subsequent steps in pHRTF generation processes. For example, in accordance with determining that multiangulation accuracy is low (e.g., below a threshold) for one ear, additional data may be obtained (e.g., a user may be prompted to recapture images corresponding to that side of their head) and the overall landmark detection and extraction process is repeated with the newly obtain data.
additional data may be obtained (e.g., a user may be prompted to recapture images corresponding to that side of their head) and the overall landmark detection and extraction process is repeated with the newly obtain data.
data corresponding to the contra-lateral ear i.e., the other ear
a corresponding pHRTF i.e., by assuming symmetry in user’s ears size and shape
depth information of images/frames is obtained (e.g., via API call) and used along with known camera and sensor intrinsic parameters P (e.g., focal length, AOV, FOV, etc.) to determine absolute scaling.
the module 14 obtains depth information which describes the distance of each object in the image from the camera image plane, computed on a pixel-by-pixel basis.
the module 14 determines the iris size in an image and infers the distance D of the head from the camera image plane using that measure.
iris size is determined based on one or more images/frames corresponding to the subject facing directly at the camera (e.g., based on a detected head pose angle). Using iris enables measuring/estimating distance of a face from the camera with less than 10% error, without requiring any specialized hardware. This relies on the property that the horizontal iris diameter of the human eye remains roughly constant at 11.7+0.5 mm across a wide population.
ear tragi a distance between the edges of the ears, also known as ear tragi.
a challenge is that the ear tragi are located in an ear plane, which is slightly further away from the camera than the face plane (to which a distance has been estimated).
Figure 6 illustrates a technique for estimating head radius in accordance with some embodiments.
the head bitragion is the distance from two ear tragi, the small pointed eminence of the external ear, situated in front of the concha, and projecting backward over the meatus.
the head bitragion is assumed as twice the length of the head radius, r.
the distance between a face plane and an ear plane (e.g., respective planes as discussed in the overview section above) is correlated to the head radius according to a pre-determined correlation coefficient a.
correlation coefficient a is ⁇ 0.6.
the distance between camera and ear plane is D + ar.
the distance x between the ear tragi are determined.
the focal length f i.e., one of the camera intrinsic parameters P
the head radius is estimated by these parameters along with some simple geometric arguments.
derived features (distances, angles, and areas) describing anthropometric features that affect pHRTFs are computed.
the derived features are used to generate pHRTFs using a generative model (e.g., a model training to generate pHRTF data from data describing anthropometric features).
a generative model e.g., a model training to generate pHRTF data from data describing anthropometric features.
FIG. 7 shows a schematic block diagram of an example electronic device or architecture 200 (e.g., an apparatus 200) suitable for implementing example embodiments of the present disclosure.
the architecture 200 may form part of a mobile device such as a smartphone 2, but may also be a stand-alone piece of equipment.
Architecture 200 may embody, but is not limited to, the system as described in relation to FIG. 2.
the architecture 200 includes central processing unit (CPU) 201 which is capable of performing various processes in accordance with a program stored in, for example, read only memory (ROM) 202 or a program loaded from, for example, storage unit 208 to random access memory (RAM) 203.
the CPU 201 may be, for example, an electronic processor 201, which may include one or more processor cores, and in some examples the processor 201 may be multiple processors.
RAM 203 the data required when CPU 201 performs the various processes is also stored, as required.
CPU 201, ROM 202 and RAM 203 are connected to one another via bus 204.
Input/output (I/O) interface 205 is also connected to bus 204.
I/O interface 205 input unit 206, that may include a keyboard, a mouse, or the like; output unit 207 that may include a display such as a liquid crystal display (LCD) and one or more speakers; storage unit 208 including a hard disk, or another suitable storage device; and communication unit 209 which may include a network interface card such as a network card (e.g., wired or wireless).
input unit 206 that may include a keyboard, a mouse, or the like
output unit 207 that may include a display such as a liquid crystal display (LCD) and one or more speakers
storage unit 208 including a hard disk, or another suitable storage device storage unit 208 including a hard disk, or another suitable storage device
communication unit 209 which may include a network interface card such as a network card (e.g., wired or wireless).
the input unit 206 also includes a camera enabling capture of a feed of images.
output unit 207 includes systems with various number of speakers.
Output unit 207 can render audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).
communication unit 209 is configured to communicate with other devices (e.g., via a network).
Drive 210 is also connected to I/O interface 205, as required.
Removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium is mounted on drive 210, so that a computer program read therefrom is installed into storage unit 208, as required.
the processes described above may be implemented as computer software programs or on a computer-readable storage medium.
embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods.
the computer program may be downloaded and mounted from the network via the communication unit 209, and/or installed from the removable medium 211, as shown in FIG. 7.
various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits (e.g., control circuitry), software, logic or any combination thereof.
the various steps of FIG. 3 can be executed by control circuitry (e.g., CPU 201 in combination with other components of FIG. 7), thus, the control circuitry may be performing the actions described in this disclosure.
Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, a processor and/or other computing device(s), which may include control circuitry. While various aspects of the example embodiments of the present disclosure are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques, or methods described herein may be implemented in, as nonlimiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
various blocks shown in the flowcharts may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s).
embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the methods as described above.
Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages.
These computer program codes may be provided to one or more processors of a general-purpose computer, special purpose computer, or other programmable data processing apparatus that has control circuitry, such that the program codes, when executed by one or more processors of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
the program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and/or servers.
the one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s).
a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
WAN Wide Area Network
LAN Local Area Network
the software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media).
computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as ROM, PROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
the illustrated partitions such as blocks in FIG. 2 are merely illustrative logical partitions for ease of discussion, where such partitions may be split into additional partitions, combined into fewer partitions, supplemented with additional partitions, or reduced by eliminating partitions, without departing from the spirit of the present invention.
the partitions of the operational steps which may be also referred to as functions, steps, operations, processes, or acts, may be combined into fewer steps or split into additional steps, where steps may be reordered or eliminated, in whole or in part, without departing from the spirit of this disclosure.
a method comprising: at an electronic device: obtaining a plurality of images capturing a subject (e.g., person, object) from a plurality of perspectives (e.g., multiple images including a subject and an anatomical feature of the subject (e.g., an ear), a sequence of images, a video, etc.); for at least two images of the plurality of images: performing landmark detection to detect locations (e.g., 2D coordinates) corresponding to a plurality of landmarks; estimating head pose data from the locations corresponding to the landmarks; and generating camera pose data from the head pose data; generating, from camera pose data corresponding to the at least 2 images of the plurality of images, relative 3D coordinate data corresponding to locations (e.g., 3D relative coordinates) of a set of anatomical features captured in the plurality of images; and
EEE2 The method of EEE1 further comprising: generating data corresponding to a head related transfer function adapted to the subject based at least in part on the scaled 3D coordinate data.
EEE3. The method of any of EEEs 1-2, further comprising: determining a motion metric (e.g., velocity or acceleration, angular or linear) associated the plurality of images; and in accordance with a determination that the motion metric satisfies a first criteria (e.g., is above a first threshold value; is below a first threshold value), the at least 2 images comprise a first quantity of images; and in accordance with a determination that the motion metric satisfies a second criteria (e.g., is below a second threshold value; is above a second threshold value), the at least 2 images comprise a second quantity of images larger than the first quantity of images.
a motion metric e.g., velocity or acceleration, angular or linear
the first and second criteria are thresholds corresponding to a common value (e.g., an angular value related to head or camera orientation). In some embodiments, the first and second criteria are thresholds corresponding to different values.
the motion metric is based on data from sensors of the electronic device (e.g., IMU, accelerometers, gyroscopes, etc.). In some embodiments, the motion metric is based image data of the plurality of images.
EEE4 The method of any of EEEs 1-3, further comprising: determining a frame rate associated the plurality of images (e.g., fps of video associated with the plurality of images); and in accordance with a determination that the frame rate satisfies a third criteria (e.g., is above a third threshold value; is below a third threshold value), the at least 2 images comprise a first quantity of images; and in accordance with a determination that the frame rate satisfies a fourth criteria (e.g., is below a fourth threshold value; is above a fourth threshold value), the at least 2 images comprise a second quantity of images larger than the first quantity of images.
the third and fourth criteria are thresholds corresponding to the same value.
the first and second criteria are thresholds corresponding to different values.
frame rate is obtained from metadata associated with the plurality of images. EEE5. The method of any of EEEs 1-4, wherein the at least 2 images are selected based at least in part on a determination that an estimated head pose angle of a respective image of the plurality of images is within pre-determined head pose range (e.g., an azimuth range).
EEE7 The method of any of EEEs 1-6, wherein detecting a plurality of landmarks includes performing at least one of: ear detection and face detection.
ear detection includes a determination of data corresponding to ear landmarks that describe the anatomic characteristics of ear pinna, helix, anti-helix, tragi, fossa, and concha.
EEE9 The method of any of EEEs 1-8, wherein estimating head pose data includes obtaining head pose data via a request (e.g., an API call; transmitting a request for head pose data for a respective image and receiving head pose data for the respective image in response to the request).
a request e.g., an API call
EEE10 The method of any of EEEs 1-9, wherein estimating head pose data includes: computing an instantaneous face mesh of the subject’s face; and estimating a roto-translation (yaw, pitch, roll) that roto-translates the instantaneous subject’s face mesh points into a generic 3D face mesh centered in an image/frame.
landmark detection includes: computing an instantaneous face mesh of the subject’s face; and using the instantaneous face mesh to detect a face location or an ear location in a respective image of at least two images.
EEE12 The method of any of EEEs 1-11, wherein generating relative 3D coordinate data includes: projecting corresponding detected landmarks from each of the at least two images to the 3D world using camera pose data or head pose data associated with each respective image of the at least two images; and performing multi-angulation, to obtain relative 3D face landmark coordinates for each detected landmark.
EEE13 The method of EEE 12, wherein a respective 3D face landmark coordinate is identified as a point that minimizes an error in intersecting the projection lines.
EEE 14 The method of any of EEEs 1-13, wherein generating scaled 3D coordinate data from the relative 3D coordinate data includes estimating a distance from the subject to the camera for at least one of: the at least two images of the plurality of images; and a reference subset of the plurality of images different from the subset formed by the at least two images of the plurality of images (e.g., a set of one or more images of the plurality of images not including the at least two images).
a reference subset corresponds to one or more images of the plurality of images capturing a frontal view of the subject (e.g., images associated with a head pose within one or more central threshold angular ranges (e.g., yaw, pitch, or roll approaching ⁇ 0° ); one or more images of the plurality of images associated with the subject directly facing a camera that captured the respective image; or one or more images of the plurality of images of the subject facing toward a camera (e.g., normal to the surface of the camera lens of the camera capturing the image).
the reference subset of the plurality of images is associated with a specific ear of the subject (Left or Right) captured in each of the at least two images of the plurality of images.
EEE15 The method of EEE 14, wherein estimating a distance from the subject to the camera is based on at least one of the set of: an iris size; an inter pupillary distance; a reference object size; data from a hardware depth sensor; and a depth from motion algorithm.
EEE16 The method of any of EEEs 14-15, wherein estimating a distance from the subject to the camera includes using Bayesian inference to combine the results from more than one distance estimation strategies (e.g., different estimation modalities).
EEE17 The method of any of EEEs 1-16, further comprising: estimating a head radius for each image of the at least two images by projecting 2D face mesh information onto an image plane.
EEE18 The method of EEE 17, wherein head radius is estimated based on a distance between an ear plane or axis (e.g., a plane that includes the two tragi, coronal to the head of the subject; an axis between the two tragi ) and a face plane or axis (e.g., a plane parallel to the ear plane, that minimizes the distance between facial mesh landmarks; an axis connecting symmetric facial landmarks).
an ear plane or axis e.g., a plane that includes the two tragi, coronal to the head of the subject; an axis between the two tragi
a face plane or axis e.g., a plane parallel to the ear plane, that minimizes the distance between facial mesh landmarks; an axis connecting symmetric facial
EEE19 The method of any of EEEs 2-18, wherein generating head related transfer data corresponding to a head related transfer function data includes: extracting features (e.g., related to an ear, face, and/or head) from scaled 3D coordinate data corresponding to anthropometric landmarks (e.g., distances, angles, areas); and applying the extracted features to a generative model to generate data corresponding to one or more PHRTFs adapted to the subject.
features e.g., related to an ear, face, and/or head
scaled 3D coordinate data corresponding to anthropometric landmarks (e.g., distances, angles, areas)
anthropometric landmarks e.g., distances, angles, areas
EEE20 The method of any of EEEs 1-19, further comprising: calculating a 2D to 3D projection error; and in accordance a determination that the projection error meets first error criteria (e.g., including a determination that the error value is greater than a first threshold), initiating a process to capture additional image data of the subject; and using the additional image data to generate scaled 3D coordinate data (e.g., according to the method of EEE 1).
first error criteria e.g., including a determination that the error value is greater than a first threshold
EEE21 The method of any of EEEs 1-19, further comprising: calculating a 2D to 3D projection error; and in accordance a determination that the projection error meets a second error criteria (e.g., including a determination that the error value is greater than a second threshold and/or a determination that a process to capture additional image data of the subject has previously been initiated), using image data associated with a first ear (e.g. left ear, right ear), to generate scaled 3D coordinate data corresponding to a second ear (e.g. right ear, left ear).
a first ear e.g. left ear, right ear
a second ear e.g. right ear, left ear
EEE22 A non-transitory computer-readable storage medium including instructions, which when executed by one or more processors of an electronic device, cause the electronic device to perform the methods of any of EEEs 1-21.
EEE23 An electronic device including one or more processors and a memory storing instructions, which when executed by the one or more processors, cause the electronic device to perform the methods of any of EEEs 1-21.
EEE24 The electronic device according to EEE 23, the electronic device further comprising one or more cameras, and wherein the plurality of images are captured by the one or more cameras.

Landscapes

Engineering & Computer Science (AREA)
Computer Vision & Pattern Recognition (AREA)
Physics & Mathematics (AREA)
General Physics & Mathematics (AREA)
Theoretical Computer Science (AREA)
Image Analysis (AREA)
Image Processing (AREA)

EP24717865.0A 2023-03-10 2024-03-07 Bildbasierte rekonstruktion von 3d-landmarken zur verwendung bei der erzeugung personalisierter kopfbezogener übertragungsfunktionen Pending EP4677546A1 (de)

Applications Claiming Priority (5)

Application Number	Priority Date	Filing Date	Title
CN2023080861		2023-03-10
CN2023082495		2023-03-20
US202363499759P	2023-05-03	2023-05-03
US202363507498P	2023-06-12	2023-06-12
PCT/US2024/018811 WO2024191729A1 (en)	2023-03-10	2024-03-07	Image based reconstruction of 3d landmarks for use in generation of personalized head-related transfer functions

Publications (1)

Publication Number	Publication Date
EP4677546A1 true EP4677546A1 (de)	2026-01-14

Family

ID=90720120

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP24717865.0A Pending EP4677546A1 (de)	2023-03-10	2024-03-07	Bildbasierte rekonstruktion von 3d-landmarken zur verwendung bei der erzeugung personalisierter kopfbezogener übertragungsfunktionen

Country Status (5)

Country	Link
EP (1)	EP4677546A1 (de)
JP (1)	JP2026509208A (de)
KR (1)	KR20250161550A (de)
CN (1)	CN120813973A (de)
WO (1)	WO2024191729A1 (de)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
CN120846281B (zh) *	2025-09-17	2025-11-28	歌尔股份有限公司	头戴设备佩戴姿态角测量方法、装置、电子设备及系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
JP7442494B2 (ja)	2018-07-25	2024-03-04	ドルビーラボラトリーズライセンシングコーポレイション	光学式捕捉によるパーソナライズされたｈｒｔｆ

2024
- 2024-03-07 CN CN202480018222.8A patent/CN120813973A/zh active Pending
- 2024-03-07 KR KR1020257029800A patent/KR20250161550A/ko active Pending
- 2024-03-07 EP EP24717865.0A patent/EP4677546A1/de active Pending
- 2024-03-07 JP JP2025550548A patent/JP2026509208A/ja active Pending
- 2024-03-07 WO PCT/US2024/018811 patent/WO2024191729A1/en not_active Ceased

Also Published As

Publication number	Publication date
KR20250161550A (ko)	2025-11-17
WO2024191729A1 (en)	2024-09-19
JP2026509208A (ja)	2026-03-17
CN120813973A (zh)	2025-10-17

Legal Events

Date	Code	Title	Description
2024-04-19	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: UNKNOWN
2024-09-21	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE
2025-12-12	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2025-12-12	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE
2026-01-14	17P	Request for examination filed	Effective date: 20250911
2026-01-14	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR
2026-03-18	P01	Opt-out of the competence of the unified patent court (upc) registered	Free format text: CASE NUMBER: UPC_APP_0004280_4677546/2026 Effective date: 20260206

Publication	Publication Date	Title
JP7785824B2 (ja)	2025-12-15	光学式捕捉によるパーソナライズされたｈｒｔｆ
US12505644B2 (en)	2025-12-23	Method for generating customized/personalized head related transfer function
EP3674852B1 (de)	2024-01-24	Verfahren und vorrichtung mit blickschätzung
US11776307B2 (en)	2023-10-03	Arrangement for generating head related transfer function filters
US20170366896A1 (en)	2017-12-21	Associating Audio with Three-Dimensional Objects in Videos
US9813693B1 (en)	2017-11-07	Accounting for perspective effects in images
CN110276317A (zh)	2019-09-24	一种物体尺寸检测方法、物体尺寸检测装置及移动终端
CN109997175A (zh)	2019-07-09	确定虚拟对象的大小
EP3757878B1 (de)	2025-07-30	Kopfhaltungsschätzung
US9483836B2 (en)	2016-11-01	Method and apparatus for real-time conversion of 2-dimensional content to 3-dimensional content
WO2024191729A1 (en)	2024-09-19	Image based reconstruction of 3d landmarks for use in generation of personalized head-related transfer functions
KR20160062665A (ko)	2016-06-02	동작 인식 장치 및 방법
US12225180B2 (en)	2025-02-11	Method and apparatus for generating stereoscopic display contents
US20250191219A1 (en)	2025-06-12	Landmark selection for ear tracking
HK40098379A (zh)	2024-04-05	经由光学捕获的个性化hrtfs
CN120692403A (zh)	2025-09-23	视频成帧
WO2024232882A1 (en)	2024-11-14	Systems and methods for multi-view depth estimation using simultaneous localization and mapping
WO2022216295A1 (en)	2022-10-13	Method and apparatus for operating an image signal processor