EP4367648A1 - Mehrfachansichts-mehrzielaktionserkennung - Google Patents

Mehrfachansichts-mehrzielaktionserkennung

Info

Publication number: EP4367648A1
Authority: EP; European Patent Office
Prior art keywords: target subject; actions; implementations; videos; processors
Prior art date: 2021-08-10
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP22751825.5A

Other languages

English (en)

French (fr)

Inventor

Wanxin Xu

Ko-Kai Albert HUANG

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Sony Group Corp

Sony Corp of America

Original Assignee

Sony Group Corp

Sony Corp of America

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2021-08-10

Filing date

2022-07-20

Publication date

2024-05-15

2021-12-22 Priority claimed from US17/559,751 external-priority patent/US12299929B2/en

2022-07-20 Application filed by Sony Group Corp, Sony Corp of America filed Critical Sony Group Corp

2024-05-15 Publication of EP4367648A1 publication Critical patent/EP4367648A1/de

Status Pending legal-status Critical Current

Links

Classifications

- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional [3D] objects
- G06V20/647—Three-dimensional [3D] objects by matching two-dimensional images to three-dimensional objects
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition

Definitions

Action recognition has become an active research area and such research continues to rapidly advance.
Some camera systems are able to capture videos of a person, analyze movements of the person, and generate an image or video dataset of metadata.
To identify human actions captured by camera videos of the system a person needs to manually view the videos.
Manual monitoring and event reporting can be unreliable and time-consuming, especially where the positions and angles of the video cameras may vary and might not provide adequate coverage.
Multiple cameras may be used in a controlled environment. However, subjects, movements, and background variation may still be substantially limited. Also, understanding the pose information of multiple people in complex environments remains a challenge.
a system includes one or more processors, and includes logic encoded in one or more non-transitory computer-readable storage media for execution by the one or more processors.
the logic is operable to cause the one or more processors to perform operations including: obtaining a plurality of videos of a plurality of subjects in an environment, where at least one target subject of the plurality of subjects performs one or more actions in the environment; tracking the at least one target subject across at least two cameras; reconstructing a 3-dimensional (3D) model of the at least one target subject based on the plurality of videos and the tracking of the at least one target subject; and recognizing the one or more actions of the at least one target subject based on the reconstructing of the 3D model.
3D 3-dimensional
the plurality of videos that are obtained are 2-dimensional (2D) videos.
the logic when executed is further operable to cause the one or more processors to perform operations including determining one or more key points for the at least one target subject.
the logic when executed is further operable to cause the one or more processors to perform operations including determining pose information associated with the at least one target subject.
the logic when executed is further operable to cause the one or more processors to perform operations including reconstructing the 3D model based on pose information.
the logic when executed is further operable to cause the one or more processors to perform operations including: determining pose information associated with the at least one target subject; and recognizing the one or more actions of the at least one target subject based on the determining of the pose information.
the logic when executed is further operable to cause the one or more processors to perform operations including distinguishing between different actions of a plurality of actions of the at least one target subject based on the reconstructing of the 3D model.
a non-transitory computer-readable storage medium with program instructions thereon When executed by one or more processors, the instructions are operable to cause the one or more processors to perform operations including: obtaining a plurality of videos of a plurality of subjects in an environment, where at least one target subject of the plurality of subjects performs one or more actions in the environment; tracking the at least one target subject across at least two cameras; reconstructing a 3-dimensional (3D) model of the at least one target subject based on the plurality of videos and the tracking of the at least one target subject; and recognizing the one or more actions of the at least one target subject based on the reconstructing of the 3D model.
3D 3-dimensional
the plurality of videos that are obtained are 2-dimensional (2D) videos.
the instructions when executed are further operable to cause the one or more processors to perform operations including determining one or more key points for the at least one target subject.
the instructions when executed are further operable to cause the one or more processors to perform operations including determining pose information associated with the at least one target subject.
the instructions when executed are further operable to cause the one or more processors to perform operations including reconstructing the 3D model based on pose information.
the instructions when executed are further operable to cause the one or more processors to perform operations including: determining pose information associated with the at least one target subject; and recognizing the one or more actions of the at least one target subject based on the determining of the pose information.
the instructions when executed are further operable to cause the one or more processors to perform operations including distinguishing between different actions of a plurality of actions of the at least one target subject based on the reconstructing of the 3D model.
a method includes: obtaining a plurality of videos of a plurality of subjects in an environment, where at least one target subject of the plurality of subjects performs one or more actions in the environment; tracking the at least one target subject across at least two cameras; reconstructing a 3-dimensional (3D) model of the at least one target subject based on the plurality of videos and the tracking of the at least one target subject; and recognizing the one or more actions of the at least one target subject based on the reconstructing of the 3D model.
the plurality of videos that are obtained are 2-dimensional (2D) videos.
the method further includes determining one or more key points for the at least one target subject.
the method further includes determining pose information associated with the at least one target subject.
the method further includes reconstructing the 3D model based on pose information.
the method further includes determining pose information associated with the at least one target subject; and recognizing the one or more actions of the at least one target subject based on the determining of the pose information.
FIG. 1 is a block diagram of an example environment 100 for recognizing actions of multiple people using multiple cameras, which may be used for implementations described herein.
FIG. 2 is an example flow diagram for recognizing actions of multiple people using multiple cameras, according to some implementations.
FIG. 3 is an example flow diagram for reconstructing a multi-view pose, according to some implementations.
FIG. 4 is a block diagram of an example environment for recognizing clinical activity using multiple cameras and an overlap region, which may be used for implementations described herein.
FIG. 5 is a block diagram of an example environment for recognizing clinical activity, which may be used for implementations described herein.
FIG. 6 is an example flow diagram for determining a multi-view pose, according to some implementations.
FIG. 7 is an example flow diagram for providing a reconstructed pose, according to some implementations.
FIG. 8 is an example flow diagram for recognizing actions of a target subject, according to some implementations.
FIG. 9 is a block diagram of an example network environment, which may be used for some implementations described herein.
FIG. 10 is a block diagram of an example computer system, which may be used for some implementations described herein.
Implementations described herein enable, facilitate, and manage robust multiview multi-target action recognition using reconstructed 3-dimensional (3D) poses. As described in more detail herein, implementations recognize multi-camera multi-target actions by utilizing information of reconstructed 3D poses as prior knowledge along with a skeleton-based neural network. Implementations described herein achieve higher performance than deep learning methods in complex environments. Implementations described herein differentiate actions of similar movement patterns and are also more flexible and scalable than existing deep learning techniques without requiring significant additional data for training.
Implementations have various potential application areas. Such areas may include, for example, behavior understanding in medical or sports fields. Applications may vary, depending on the particular application. Other example application areas may include human-computer interaction, surveillance and security, retail industries, manufacturing industries, etc.
a system obtains multiple videos of multiple subjects in an environment, where at least one target subject of the multiple subjects performs one or more actions in the environment.
the system further tracks the at least one target subject across at least two cameras.
the system further reconstructs a 3D model of the at least one target subject based on the videos and the tracking of the at least one target subject.
the system further recognizes the one or more actions of the at least one target subject based on the reconstructing of the 3D model.
FIG. 1 is a block diagram of an example environment 100 for recognizing actions of multiple people using multiple cameras, which may be used for implementations described herein.
system 102 is a context-aware system that provides robust recognitions of actions of multiple people using multiple cameras.
environment 100 includes a system 102, which communicates with a client 104 via a network 106.
Network 106 may be any suitable communication network such as a Wi-Fi network, Bluetooth network, the Internet, etc.
environment 100 may be any environment, where activity involving multiple subjects (e.g., multiple people and/or multiple objects, etc.) is recognized, monitored, and tracked by system 102.
environment 100 may be any setting including work settings and public settings.
environment 100 may be a retail store, a clinical setting, a public park, etc.
system 102, client 104, and network 106 may be local to environment 100, remote to environment 100 (e.g., in the cloud), or a combination thereof.
activity area 108 Shown is an activity area 108, which may be an indoor area or outdoor area in environment 100.
activity area 108 may include indoor and outdoor portions.
the configuration of activity area 108 may vary, depending on the particular implementation.
a portion of activity area 108 may include an indoor seating area of a restaurant and may include an outdoor patio seating area of the restaurant.
subjects 110 Also shown are people or subjects 110, 112, and 114. While example subjects are described in the context of people, subjects may also include inanimate objects, all of which are captured by multiple video cameras 120, 122, 124, and 126.
the videos are captured by multiple video cameras.
system 102 monitors the activity of subjects or people 110, 112, 114, etc. in an activity area 108 using physical video cameras 120, 122, 124, 126, which capture video of people 110, 112, 114 at different angles or viewpoints.
system 102 identifies at least one target subject from the multiple subjects. While various implementations are described in the context of a single target subject, these implementations also apply to each of multiple target subjects. As such, the system tracks one or more target subjects, reconstructs one or more 3D models of the target subjects, and recognizes actions of the one or more target subjects. Various example implementations directed to these aspects are described in more detail herein. In various implementations, each of subjects 110, 112, 114, etc. may represent one or more people. Also, implementations and references to a particular target subject may apply to any and all target subjects. The number of target subjects may vary, depending on the particular implementation.
subjects 110, 112, 114 may represent one or more of clinicians such as a doctor and nurse, one or more assistants, a patient, etc.
subjects 110, 112, and 114 there may also be one or more inanimate objects (not shown) that the system may track.
objects may include one or more hospital beds, surgery equipment, surgery tools, etc.
the particular type of object may vary and will depend on the particular implementation.
a given subject may also be referred to as a subject, a person, a target subject, an object, or an inanimate object.
the system utilizes vision-based approaches, which are efficient in that there is no need for subjects to have any wearable equipment. Visionbased approaches are also highly scalable to different settings of the system.
the system automatically and accurately recognizes activity in a clinical environment (e.g., operating room, emergency room, etc.), which enables understanding of surgical or clinical workflow that is critical for optimizing clinical activities.
the system performs real-time monitoring of staff and patient activities in an environment in order to enhance patient outcomes and care with reduced staff costs.
FIG. 1 shows one block for each of system 102, client 104, network 106, and activity area 108. Blocks 102, 104, 106, and 108 may represent multiple systems, client devices, networks, and activity areas. Also, there may be any number of subjects in a given activity area.
environment 100 may not have all of the components shown and/or may have other elements including other types of elements instead of, or in addition to, those shown herein.
system 102 performs implementations described herein, in other implementations, any suitable component or combination of components associated with system 102 or any suitable processor or processors associated with system 102 may facilitate performing the implementations described herein.
FIG. 2 is an example flow diagram for recognizing actions of multiple people using multiple cameras, according to some implementations.
a method is initiated at block 202, where a system such as system 102 receives or obtains multiple videos of multiple subjects in activity area 108 of environment 100.
the multiple subjects captured in the videos include at least one target subject to be tracked, where the target subject performs at least one action in environment 100.
the cameras records the videos, and may store the videos in any suitable storage location.
video sequences are captured from multiple cameras, where the cameras may be configured with predetermined (including pre-calibrated) camera parameters.
Such camera parameters may include one or more intrinsic matrices, one or more extrinsic matrices, etc. While various example implementations are described in the context of the target subject, these implementations may also apply to one or more or all of the other subjects captured in the videos. In other words, there may be multiple target subjects tracked, where the system recognizes actions of each target subject being tracked.
system 102 tracks the at least one target subject across at least two cameras (e.g., video cameras 120, 122, 124, 126, etc.).
the number of cameras and their positions relative to the target subject(s) may vary, depending on the particular implementation.
the videos that are obtained are 2- dimensional (2D) videos.
the system avoids cross-view association ambiguity by processing 2D video information from multiple cameras. noisysy and incomplete 2D poses resulting from occlusions may complicate the associations of a given pose from different cameras, which may further influence the reconstruction of the pose in 3D space.
the system may track each individual object from camera to camera without losing sight of the object.
the system determines one or more key points for one or more of the subjects that the system tracks via the video cameras including the target subject.
the system also determines and/or estimates pose information associated with one or more of the objects or subjects including the target subject.
the system may perform 2D pose estimations using any suitable pose estimator and pre-calibrated cameras.
the system also determines pose information based on the respective key points associated with each object or subject tracked.
the system determines pose information associated with the at least one target subject based on triangulation. Further implementations directed to key points, pose information, and triangulation are described in more detail herein.
system 102 reconstructs a 3-dimensional (3D) model of the target subject based on the videos and the tracking of the target subject.
the system reconstructs the 3D model of the object or target subject based on the videos, where the videos are 2D videos.
the system determines pose information associated with the target subject.
the system reconstructs the 3D model based on the pose information.
system 102 recognizes the one or more actions of the target subject based on the reconstructing of the 3D model.
system 102 determines or estimates pose information associated with actions of the target subject.
the system then recognizes the one or more actions of the target subject based on the pose information associated with the at least one target subject and in association with the reconstructing of the 3D model.
the system distinguishes between different actions of the target subject based on the reconstructing of the 3D model, including the pose determinations or pose estimations.
the system recognizes the actions of multiple subjects utilizing a set of pre-calibrated cameras efficiently and robustly.
precalibrated camera may include cameras 120, 122, 124, and 126, for example.
FIG. 3 through FIG. 7 and associated descriptions involve various aspects directed to the reconstructing of the 3D model.
FIG. 8 and associated descriptions involve various aspects directed to the recognition of actions of the target subject.
FIG. 3 is an example flow diagram for reconstructing a multi-view pose, according to some implementations. The following details describe pose reconstruction and a tracking framework, according to some implementations. Referring to both FIGS.
a method is initiated at block 302, where a system such as system 102 obtains camera parameters.
the camera parameters may include an intrinsic matrix and an extrinsic matrix for each camera in the system, depending on the setting of the environment.
system 102 computes two-dimensional (2D) pose information.
the system may utilize a general key point estimator and use either a top-down or bottom-up approach.
system 102 matches 2D poses.
the pose matching maintains and tracks the identity of each target subject captured on video consistent across multiple cameras.
the system may apply one or more metrics for matching.
Example metrics may include epipolar constraints, a Euclidean distance and algorithm for data association, a Hungarian algorithm, etc.
the system may associate the 2D poses of the same person across different camera views by using geometric and cycle-consistent constraints, etc. As such, if a person leaves the field of view of one camera, the same person will be captured in the field of view of another camera in the same environment.
the system may track the movement and pose of a person based on detection and knowledge of portions of the person such as joints of limbs, height, joint and limb positions, trajectory of the person, etc.
implementations described herein reduce computations by using the pose tracking information in 3D space.
system 102 obtains back-projected 2D pose information.
the system may obtain back-projected 2D pose information by projecting 3D pose information from block 310 (described below) to an image plane.
tracking information from 3D space provides guidance to the current frame for pose matching at block 306.
system 102 reconstructs a 3D pose.
the system determines the 3D location of a pose based on multiple 2D corresponding poses and triangulation. Implementations directed to triangulation are described in more detail herein in connection with FIG. 7, for example.
FIG. 4 is a block diagram of an example environment 400 for recognizing clinical activity using multiple cameras and an overlap region, which may be used for implementations described herein.
Environment 400 includes cameras 402, 404, and 406.
cameras 402 - 406 may be positioned at different locations.
cameras 402 - 406 may be positioned at different locations such that their fields of view overlap. As shown, the fields of view of cameras 402, 404, and 406 overlap at overlap region 408. When a given subject or subjects (e.g., staff, patient, etc.) is positioned in overlap region 408, each of cameras 402, 404, and 406 is able to capture footage of the given subject or subjects.
a given subject or subjects e.g., staff, patient, etc.
cameras 402 - 406 are set up pre-calibrated to avoid occlusion and to enable 3D reconstruction of subjects in the environment.
the subjects used for calibration are visible by all the cameras simultaneously. While 3 cameras are shown, there may be any number of cameras in environment 400. The particular number of cameras may depend on the particular environment.
the system uses cameras 402 - 406 to monitor subjects such as tile on floor in order to calibrate patterns in the environment.
Alternative camera calibration methods may be used including a commonly used checkerboard pattern or using red-green-blue-depth (RGB-D) cameras.
FIG. 5 is a block diagram of an example environment 500 for recognizing clinical activity, which may be used for implementations described herein. Shown are cameras 502 and 504, which capture video footage of subjects 506 and 508. Subjects 506 and 508 may be, for example, staff members in an operating room, or a staff member and a patient in the operating room, etc.
the system performs data fusion and clinical action recognition, including skeleton-based activity recognition.
data fusion is a process that associates or fuses the pose of a person from one camera to the pose of the same person from other cameras. After data fusion, the system reconstructs the 3D poses of all subjects (e.g., staff, patient, etc.) in a virtual 3D space, given multiple 2D corresponding poses.
FIG. 6 is an example flow diagram for determining a multi-view pose, according to some implementations. Referring to both FIGS. 1 and 6, a method is initiated at block 602, where a system such as system 102 obtains back-projected 2D pose information.
system 102 obtains estimated poses.
the system collects estimated poses for each subject detected by the cameras.
system 102 finds corresponding poses.
corresponding poses may include different poses of the same subject (e.g., person) captured by different cameras.
system 102 matches poses. For example, the system matches the poses from the same subject (e.g., person) from the different cameras. In some implementations, the system performs the pose matching step if the pose fails to be matched to any existing tracklets.
a tracklet may be defined as a fragment of a track followed by a moving subject, as constructed by an image recognition system.
the system may apply one or more metrics for matching.
Example metrics may include epipolar constraints, a Euclidean distance and algorithm for data association, a Hungarian algorithm, etc.
system 102 provides match results.
the match results indicate all of the poses of each particular subject (e.g., person).
FIG. 7 is an example flow diagram for providing a reconstructed pose, according to some implementations. Referring to both FIGS. 1 and 7, a method is initiated at block 702, where a system such as system 102 matches 2D poses.
system 102 selects multiple pairs of views from the 2D poses.
the system obtains each pair from a different camera.
the selection of the multiple pairs of views may be based on two conditions.
the first condition may be to select pairs of views based on a re -projection error being below a predetermined threshold.
the second condition may be to select pairs of views based on a confidence score being greater than a predetermined threshold. For example, a higher confidence score may be associated with less occlusion, and a lower confidence score may be associated with more occlusion.
the selection may be achieved by minimizing the re-projection error and by maximizing the confidence score for accurate 3D reconstruction.
the method follows two series of steps to provide the reconstructed pose.
the first series is associated with blocks 706, 708, and 710.
the system performs these steps if the set of pairs of views are not empty.
the second series is associated with blocks 712, 714, and 716. The system performs these steps if no pairs of views are chosen.
system 102 selects two views.
the system selects two views with a maximum-rank confidence score and a minimum-rank reprojection error.
the system may use the two views to perform triangulation for 3D pose reconstruction, as described below in connection with block 708.
system 102 performs triangulation.
the system may utilize adaptive triangulation. Triangulation may be used to obtain 3D pose information based on given 2D matched poses in the multi-view framework.
the system may adaptively select a subset of camera views for 3D pose reconstruction instead of performing reconstruction over all cameras. For example, to minimize computation, the system may determine the cameras that capture a given target subject. Other cameras that do not capture the given subject are not needed and thus not used to collect information for that particular subject. Using only the cameras that capture the subject ensures that the system performs sufficient yet not excessive computations.
system 102 provides a reconstructed pose.
the system determines the 3D location of each pose of the same subject (e.g., clinician, patient, etc.) based on multiple 2D corresponding poses and triangulation.
the system determines the poses from the video feed of the multiple cameras in order to reconstruct a 3D pose of each subject.
the second series is associated with blocks 712, 714, and 716.
the system performs these steps if no pairs of views are chosen.
system 102 performs triangulation.
system 102 performs triangulation similarly to step 708 described above.
system 102 merges poses together.
the system aggregates the poses of each subject (e.g., clinician, patient, etc.) from different viewpoints of the different cameras capturing each subject.
system 102 provides a reconstructed pose.
system 102 performs triangulation similarly to step 710 described above.
FIG. 8 is an example flow diagram for recognizing actions of a target subject, according to some implementations.
a method is initiated at block 802, where a system such as system 102 determines estimated 2D poses.
the system collects estimated 2D poses for each subject detected by the cameras.
the system determines the time [t] of each estimated 2D pose.
system 102 determines poses in 3D space.
the system collects estimated poses for each subject detected by the cameras.
the system utilizes a skeleton-based approach with reconstructed 3D poses to help improve the robustness of action recognition.
the system may determine the height of a target subject in absolute values (e.g., 5’8”, etc.), or the height relative to the height of other subjects (e.g., taller by 2”, shorter by 1”, etc.).
the system may determine the center of mass of the target subject. The center of mass may be useful in determining the position of the target subject relative to other subjects (e.g., people, objects, etc.).
the system may determine a movement trajectory of the target subject relative to another subject (e.g., walking past a particular other subject, etc.).
system 102 determines and converts back-projected pose information from 3D space to 2D space information.
the system may utilize 3D position information from previous frames in order to differentiate similar actions from one another. For example, the system determines the time [t] and back-projected times (t-n) of the back-projected pose information. The system may compare the pose of the target subject at different times based on previous frames to collect pose information about the target subject in 3D space and 2D space.
system 102 recognizes one or more actions of the target subject.
the system may determine various actions, order of actions, and times of actions. For example, the system may determine if the target subject was seated, if the target subject was standing, and the order and times that the target subject seated and standing.
the system may use various machine learning or deep learning techniques to recognize the actions.
the system may use convolutional neural network (CNN), a recurrent neural network (RNN), a graph convolutional network (GCN), or other suitable neural network(s) to recognize one or more actions of the target subject.
CNN convolutional neural network
RNN recurrent neural network
GCN graph convolutional network
system 102 determines one or more action categories for the recognized action or actions of the target subject. For example, the system may categorize a given action as a transition action (e.g., seated to standing, etc.). In another example, the system may categorize a given action as a movement (e.g., walking, raising a hand, etc.). In another example, the system may categorize a given action as a handling of an object (e.g., picking up a computer, inserting a key in a door, etc.). The particular categories may vary, depending on the particular implementation.
the system uses the reconstructed poses in a 3D virtual space as prior knowledge, and the system recognizes the actions for all target subjects in the scene with a deep learning based approach. Implementations may be built upon any 2D and/or 3D pose estimation systems.
the system may detect if a target subject is committing a crime or other unacceptable behavior based on the actions and categories of actions.
the system may monitor a target subject as the target subject is playing a video game (e.g., tracking movements of the target subject in the context of the video game, etc.).
Implementations are robust to occlusion, which may occur frequently in practical applications where multiple observed subjects are involved. This may involve self-occlusion. For example, a given target subject may move to a position that is blocked by another subject from a given camera. This may also involve inter-object occlusion, where a given target subject is blocked by an object from a given camera. Using multiple cameras and tracking the target subject based on the 3D model avoids such occlusion issues. For example, in some implementations, the system may distinguish between positions of subjects based on respective key points, and track these positions. Likewise, the system identifies and tracks distinct actions of the various subjects, including one or more target subjects. The system may determine which portions of a given subject are occluded. By tracking relative key points, the system may ascertain the positions and actions using the 3D model based on multiple cameras and multiple respective fields of view. Implementations require minimum data for optimal performance, unlike conventional systems that require a certain amount of data for efficient analysis and training.
Implementations also apply to uncontrolled environments, where there may be a lack of distinguishable visual information due to motion blur and illumination variations.
the system may adapt to motion blur by accessing video from the multiple cameras, where some cameras might not experience motion blur.
the system adapts to illumination variation and changes based on the 3D model. For example, the system may detect changes in illumination based on one or more cameras. The system may adjust or recalibrate one or more of the cameras automatically without human intervention.
These functions may apply to various real-world applications (e.g., healthcare, security, human-computer interaction and etc.).
Implementations described herein provide various benefits. For example, implementations described herein are simple yet effective in multi-camera multi-target pose reconstruction in 3D. Implementations described herein also provide a cost- effective solution for pose matching, which serves as an important step for further 3D pose reconstruction. Implementations described herein achieve higher performance than deep learning methods. Implementations described herein are also more flexible and scalable than existing deep learning techniques without requiring a significant additional data for training. The ability to recognize actions of multiple target subjects has an advantage of also tracking interactions or exchanges between two or more target subjects (e.g., interactions in a ball game, sale transactions in a retail store, etc.).
FIG. 9 is a block diagram of an example network environment 900, which may be used for some implementations described herein.
network environment 900 includes a system 902, which includes a server device 904 and a database 906.
system 902 may be used to implement system 102 of FIG. 1, as well as to perform implementations described herein.
Network environment 900 also includes client devices 910, 920, 930, and 940, which may communicate with system 902 and/or may communicate with each other directly or via system 902.
Network environment 900 also includes a network 950 through which system 902 and client devices 910, 920, 930, and 940 communicate.
Network 950 may be any suitable communication network such as a Wi-Fi network, Bluetooth network, the Internet, etc.
FIG. 9 shows one block for each of system 902, server device 904, and network database 906, and shows four blocks for client devices 910, 920, 930, and 940.
Blocks 902, 904, and 906 may represent multiple systems, server devices, and network databases. Also, there may be any number of client devices.
environment 900 may not have all of the components shown and/or may have other elements including other types of elements instead of, or in addition to, those shown herein.
server device 904 of system 902 performs implementations described herein, in other implementations, any suitable component or combination of components associated with system 902 or any suitable processor or processors associated with system 902 may facilitate performing the implementations described herein.
a processor of system 902 and/or a processor of any client device 910, 920, 930, and 940 cause the elements described herein (e.g., information, etc.) to be displayed in a user interface on one or more display screens.
FIG. 10 is a block diagram of an example computer system 1000, which may be used for some implementations described herein.
computer system 1000 may be used to implement server device 904 of FIG. 9 and/or system 102 of FIG. 1, as well as to perform implementations described herein.
computer system 1000 may include a processor 1002, an operating system 1004, a memory 1006, and an input/output (I/O) interface 1008.
processor 1002 may be used to implement various functions and features described herein, as well as to perform the method implementations described herein. While processor 1002 is described as performing implementations described herein, any suitable component or combination of components of computer system 1000 or any suitable processor or processors associated with computer system 1000 or any suitable system may perform the steps described. Implementations described herein may be carried out on a user device, on a server, or a combination of both.
Computer system 1000 also includes a software application 1010, which may be stored on memory 1006 or on any other suitable storage location or computer-readable medium.
Software application 1010 provides instructions that enable processor 1002 to perform the implementations described herein and other functions.
Software application may also include an engine such as a network engine for performing various functions associated with one or more networks and network communications.
the components of computer system 1000 may be implemented by one or more processors or any combination of hardware devices, as well as any combination of hardware, software, firmware, etc.
FIG. 10 shows one block for each of processor 1002, operating system 1004, memory 1006, I/O interface 1008, and software application 1010. These blocks 1002, 1004, 1006, 1008, and 1010 may represent multiple processors, operating systems, memories, I/O interfaces, and software applications.
computer system 1000 may not have all of the components shown and/or may have other elements including other types of components instead of, or in addition to, those shown herein.
routines of particular implementations including C, C++, Java, assembly language, etc.
Different programming techniques can be employed such as procedural or object oriented.
the routines can execute on a single processing device or multiple processors. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular implementations. In some particular implementations, multiple steps shown as sequential in this specification can be performed at the same time.
Particular implementations may be implemented in a non-transitory computer- readable storage medium (also referred to as a machine-readable storage medium) for use by or in connection with the instruction execution system, apparatus, or device.
a non-transitory computer- readable storage medium also referred to as a machine-readable storage medium
control logic in software or hardware or a combination of both.
the control logic when executed by one or more processors is operable to perform the implementations described herein and other functions.
a tangible medium such as a hardware storage device can be used to store the control logic, which can include executable instructions.
Particular implementations may be implemented by using a programmable general purpose digital computer, and/or by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms.
the functions of particular implementations can be achieved by any means as is known in the art.
Distributed, networked systems, components, and/or circuits can be used. Communication, or transfer, of data may be wired, wireless, or by any other means.
a “processor” may include any suitable hardware and/or software system, mechanism, or component that processes data, signals or other information.
a processor may include a system with a general-purpose central processing unit, multiple processing units, dedicated circuitry for achieving functionality, or other systems. Processing need not be limited to a geographic location, or have temporal limitations. For example, a processor may perform its functions in “real-time,” “offline,” in a “batch mode,” etc. Portions of processing may be performed at different times and at different locations, by different (or the same) processing systems.
a computer may be any processor in communication with a memory.
the memory may be any suitable data storage, memory and/or non-transitory computer-readable storage medium, including electronic storage devices such as random-access memory (RAM), read-only memory (ROM), magnetic storage device (hard disk drive or the like), flash, optical storage device (CD, DVD or the like), magnetic or optical disk, or other tangible media suitable for storing instructions (e.g., program or software instructions) for execution by the processor.
a tangible medium such as a hardware storage device can be used to store the control logic, which can include executable instructions.
the instructions can also be contained in, and provided as, an electronic signal, for example in the form of software as a service (SaaS) delivered from a server (e.g., a distributed system and/or a cloud computing system).
SaaS software as a service

Landscapes

Engineering & Computer Science (AREA)
Physics & Mathematics (AREA)
General Physics & Mathematics (AREA)
Multimedia (AREA)
Theoretical Computer Science (AREA)
Health & Medical Sciences (AREA)
Computer Vision & Pattern Recognition (AREA)
General Health & Medical Sciences (AREA)
Psychiatry (AREA)
Social Psychology (AREA)
Human Computer Interaction (AREA)
Image Analysis (AREA)

EP22751825.5A 2021-08-10 2022-07-20 Mehrfachansichts-mehrzielaktionserkennung Pending EP4367648A1 (de)

Applications Claiming Priority (3)

Application Number	Priority Date	Filing Date	Title
US202163260108P	2021-08-10	2021-08-10
US17/559,751 US12299929B2 (en)	2021-08-10	2021-12-22	Multi-view multi-target action recognition
PCT/IB2022/056655 WO2023017339A1 (en)	2021-08-10	2022-07-20	Multi-view multi-target action recognition

Publications (1)

Publication Number	Publication Date
EP4367648A1 true EP4367648A1 (de)	2024-05-15

Family

ID=82846496

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP22751825.5A Pending EP4367648A1 (de)	2021-08-10	2022-07-20	Mehrfachansichts-mehrzielaktionserkennung

Country Status (4)

Country	Link
EP (1)	EP4367648A1 (de)
JP (1)	JP2024530490A (de)
KR (1)	KR20230141866A (de)
WO (1)	WO2023017339A1 (de)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
JP2017080202A (ja) *	2015-10-29	2017-05-18	キヤノンマーケティングジャパン株式会社	情報処理装置、情報処理方法、プログラム
WO2019032307A1 (en) *	2017-08-07	2019-02-14	Standard Cognition, Corp.	PREDICTING INVENTORY EVENTS THROUGH PREVIOUS / BACKGROUND TREATMENT
US11328239B2 (en) *	2019-04-12	2022-05-10	University Of Iowa Research Foundation	System and method to predict, prevent, and mitigate workplace injuries
CN112766057B (zh) *	2020-12-30	2022-05-13	浙江大学	一种面向复杂场景细粒度属性驱动的步态数据集合成方法

2022
- 2022-07-20 EP EP22751825.5A patent/EP4367648A1/de active Pending
- 2022-07-20 JP JP2024507163A patent/JP2024530490A/ja active Pending
- 2022-07-20 WO PCT/IB2022/056655 patent/WO2023017339A1/en not_active Ceased
- 2022-07-20 KR KR1020237030517A patent/KR20230141866A/ko active Pending

Also Published As

Publication number	Publication date
KR20230141866A (ko)	2023-10-10
JP2024530490A (ja)	2024-08-21
WO2023017339A1 (en)	2023-02-16

Legal Events

Date	Code	Title	Description
2022-08-19	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: UNKNOWN
2023-02-17	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE
2024-04-12	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2024-04-12	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE
2024-05-15	17P	Request for examination filed	Effective date: 20240207
2024-05-15	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
2024-11-13	DAV	Request for validation of the european patent (deleted)
2024-11-13	DAX	Request for extension of the european patent (deleted)

Publication	Publication Date	Title
US12094607B2 (en)	2024-09-17	Systems and methods to identify persons and/or identify and quantify pain, fatigue, mood, and intent with protection of privacy
US11688139B1 (en)	2023-06-27	System for estimating a three dimensional pose of one or more persons in a scene
US10095930B2 (en)	2018-10-09	System and method for home health care monitoring
Hesse et al.	2018	Computer vision for medical infant motion analysis: State of the art and rgb-d data set
Baak et al.	2011	A data-driven approach for real-time full body pose reconstruction from a depth camera
Valenti et al.	2011	Combining head pose and eye location information for gaze estimation
CN111448589B (zh)	2024-03-08	用于检测患者的身体移动的设备、系统和方法
US12299929B2 (en)	2025-05-13	Multi-view multi-target action recognition
JP7767464B2 (ja)	2025-11-11	複数のカメラを用いた臨床活動認識
Singh et al.	2019	A multi-gait dataset for human recognition under occlusion scenario
Moro et al.	2021	On The Precision Of Markerless 3d semantic features: An experimental study on violin playing
US11704829B2 (en)	2023-07-18	Pose reconstruction by tracking for video analysis
EP4367648A1 (de)	2024-05-15	Mehrfachansichts-mehrzielaktionserkennung
CN116762107A (zh)	2023-09-15	多视角多目标动作识别
Kadkhodamohammadi et al.	2014	Temporally consistent 3D pose estimation in the interventional room using discrete MRF optimization over RGBD sequences
Marcialis et al.	2014	A novel method for head pose estimation based on the “Vitruvian Man”
Wientapper et al.	2009	Linear-projection-based classification of human postures in time-of-flight data
Jaleel et al.	2023	Body motion detection and tracking using a Kinect sensor
Bozomitu et al.	2017	Pupil detection algorithm based on RANSAC procedure
Takač et al.	2012	Ambient sensor system for freezing of gait detection by spatial context analysis
Mikrut et al.	2012	Combining pattern matching and optical flow methods in home care vision system
Pramerdorfer	2013	Depth data analysis for fall detection
Rougier et al.	2009	Video Surveillance for Fall Detection
Trinh et al.	2014	Human extraction from a sequence of depth images using segmentation and foreground detection
Cristina et al.	2019	Gaze Tracking by Joint Head and Eye Pose Estimation Under Free Head Movement