TW201911119A

TW201911119A - Can be used in systems without cashiers and their components

Info

Publication number: TW201911119A
Application number: TW107126341A
Authority: TW
Inventors: 喬丹費雪; 丹尼爾菲奇帝; 布蘭登歐格; 約翰諾瓦克; 凱爾多曼; 肯尼士木原; 喬安拉席拉; 大衛瓦德曼
Original assignee: 美商標準認知公司
Priority date: 2017-08-07
Filing date: 2018-07-30
Publication date: 2019-03-16
Also published as: WO2019032304A1; EP3665648A4; WO2019032307A1; WO2019032306A9; JP2020530170A; CA3072056A1; TWI773797B; WO2019032306A1; JP7181922B2; JP7191088B2; WO2019032305A2; EP3665615A1; EP3665648A2; CA3072058A1; EP3665647A4; JP2020530168A; EP3665647A1; JP7228569B2; JP2020530167A; JP7208974B2

Abstract

提供系統及技術以追蹤真實空間之區域中藉由主體之存貨項目的放下及取走。具有具有重疊觀看域之複數相機係產生該真實空間中之相應觀看域的影像之各別序列。於一實施例中，該系統包括第一影像處理器，包括主體影像辨識引擎，接收來自該些複數相機之影像的相應序列。該些第一影像處理器係處理影像以識別影像之該些相應序列中的該些影像中所表示之主體。該系統包括第二影像處理器，包括背景影像辨識引擎，接收來自該些複數相機之影像的相應序列。第二影像處理器係遮蔽該些已識別主體以產生已遮蔽影像。接續於此，該些第二影像處理器係處理該些已遮蔽影像以識別並分類影像之該些相應序列中的該些影像中所表示之背景改變。Provide systems and technologies to track the drop and removal of inventory items in the real space area by the subject. A plurality of cameras with overlapping viewing fields generate separate sequences of images of corresponding viewing fields in the real space. In one embodiment, the system includes a first image processor including a subject image recognition engine that receives a corresponding sequence of images from the plurality of cameras. The first image processors process the images to identify the subjects represented in the images in the corresponding sequences of the images. The system includes a second image processor, including a background image recognition engine, and receives a corresponding sequence of images from the plurality of cameras. The second image processor masks the identified subjects to generate a masked image. Continuing here, the second image processors process the occluded images to identify and classify the background changes represented in the images in the corresponding sequences of the images.

Description

System and components that can be used for cashierless checkout

本發明係有關可用於無出納員結帳之系統及其組件。The present invention relates to a system and its components that can be used for cashierless checkout.

影像處理之困難問題係發生在當來自其配置在大空間上之多數相機的影像被用以識別並追蹤主體的動作時。Difficulties in image processing occur when images from most cameras deployed in a large space are used to identify and track the subject's actions.

追蹤真實空間的區域內之主體的動作(諸如購物商店中的人)引起了許多技術上的挑戰。例如，考量購物商店中所部署的此一影像處理系統，其中有多數消費者在購物商店內之貨架間的走道以及開放空間中移動。消費者從貨架取走項目並將那些項目放入其各別的購物車或籃中。消費者亦可將項目放在貨架上，假如他們不想要該項目的話。Tracking the actions of subjects in areas of real space, such as people in a shopping store, poses many technical challenges. For example, considering this image processing system deployed in a shopping store, most consumers move among walkways and open spaces between shelves in the shopping store. Consumers remove items from the shelves and place those items in their respective shopping carts or baskets. Consumers can also put items on the shelf if they don't want the item.

當消費者正履行這些動作時，消費者之不同部分以及貨架之不同部分或者固持該商店之存貨的其他展示結構將被阻擋於來自不同相機的影像中，由於其他消費者、貨架、及產品陳列等等之存在。同時，任何既定時刻在該商店中可能有許多消費者，使其難以隨時識別及追蹤個人及其動作。When the consumer is performing these actions, different parts of the consumer and different parts of the shelf or other display structures holding the store's inventory will be blocked from images from different cameras, because other consumers, shelves, and product displays And so on. At the same time, there may be many consumers in the store at any given moment, making it difficult to identify and track individuals and their actions at any time.

希望提供一種系統，其可更有效率地且自動地識別及追蹤大空間中之主體的放下及取走動作；並履行支援主體與其環境之複雜互動的其他程序，包括諸如無出納員結帳等功能。It is desirable to provide a system that can more efficiently and automatically identify and track the dropping and removing of subjects in a large space; and perform other procedures that support the complex interaction of the subject with its environment, including, for example, cashierless checkout Features.

一種系統、及用以操作系統之方法被提供來追蹤藉由真實空間之區域中的主體(諸如人)之改變、及與其環境之主體的其他複雜互動，使用影像處理。藉由影像處理以追蹤改變之此功能造成了電腦工程之複雜問題，有關待處理之影像資料的類型、影像資料之何種處理應履行、及如何從具有高可靠度之影像資料判定動作。文中所述之系統可僅使用來自其配置於真實空間中之上方的相機之影像，以致其無須針對具有感應器等之商店架及地板空間的翻新以供部署於既定設定中。A system and method for operating the system are provided to track changes in subjects (such as people) in areas through real space, and other complex interactions with subjects in its environment, using image processing. This function of tracking changes by image processing causes complex problems in computer engineering, regarding the type of image data to be processed, what processing of image data should be performed, and how to determine actions from image data with high reliability. The system described in this article can only use images from its camera located above the real space, so that it does not need to refurbish store shelves and floor spaces with sensors, etc., for deployment in a given setting.

提供一種用以追蹤藉由包括存貨展示結構之真實空間的區域中之主體的存貨項目之放下及取走的系統及方法，其包含使用配置於該些存貨展示結構之上的複數相機以產生該些存貨展示結構之影像的各別序列在該真實空間中的相應觀看域中，各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。使用影像之這些序列，一種系統及方法被描述為：藉由語意上識別其相關於存貨展示結構上之存貨項目的影像之該些序列中的顯著改變以檢測存貨項目之放下及取走；及使該些語意上的顯著改變與影像之該些序列中所表示的主體關聯。Provided is a system and method for tracking down and removal of an inventory item by a subject in an area including a real space of an inventory display structure, which includes using a plurality of cameras disposed on the inventory display structures to generate the The respective sequences of the images of the inventory display structures are in the corresponding viewing domain in the real space, and the viewing domain of each camera overlaps the viewing domain of at least one other camera among the plurality of cameras. Using these sequences of images, a system and method is described as: detecting the drop and removal of inventory items by semantically identifying significant changes in those sequences of images of inventory items on the inventory display structure; and Associate these significant changes in semantics with the subjects represented in the sequences of the image.

提供一種用以追蹤藉由真實空間的區域中之主體的存貨項目之放下及取走的系統及方法，其包含使用配置於該些存貨展示結構之上的複數相機以產生該些存貨展示結構之影像的各別序列在該真實空間中的相應觀看域中，各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。使用影像之這些序列，一種系統及方法被描述為藉由處理影像之該些序列中的前台資料以識別主體之姿勢及與該些姿勢相關的存貨項目來檢測存貨項目之放下及取走。Provided is a system and method for tracking down and removal of inventory items of a subject in an area of real space, including using a plurality of cameras disposed on the inventory display structures to generate the inventory display structures The respective sequences of the images are in the corresponding viewing domain in the real space, and the viewing domain of each camera overlaps the viewing domain of at least one other camera among the plurality of cameras. Using these sequences of images, a system and method is described that detects the drop and removal of inventory items by processing foreground data in the sequences of images to identify the subject's posture and inventory items related to the postures.

同時，描述一種系統及方法，其係結合影像之相同序列中的前台處理與背景處理。於此結合的方式中，所提供之該系統及方法包括使用影像之這些序列以藉由處理影像之該些序列中的前台資料以識別主體之姿勢及與該些姿勢相關的存貨項目來檢測存貨項目之放下及取走；及使用影像之這些序列以藉由語意上識別其相關於存貨展示結構上之存貨項目的影像之該些序列中的顯著改變以檢測存貨項目之放下及取走，及使該些語意上的顯著改變與影像之該些序列中所表示的主體關聯。At the same time, a system and method are described that combine foreground processing and background processing in the same sequence of images. In this combined manner, the provided system and method include using sequences of images to detect inventory by processing foreground data in the sequences of images to identify the subject's posture and inventory items related to the postures. Drop and take of items; and use these sequences of images to detect significant changes in those sequences of images that semantically identify them in relation to inventory items on the inventory display structure, and to detect drop and take of inventory items, and Associate these significant changes in semantics with the subjects represented in the sequences of the image.

於文中所述之實施例中，該系統係使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。該系統包括第一影像處理器，包括主體影像辨識引擎，接收來自該些複數相機之影像的相應序列。該些第一影像處理器係處理影像以識別影像之該些相應序列中的該些影像中所表示之主體。該系統進一步包括第二影像處理器，包括背景影像辨識引擎，接收來自該些複數相機之影像的相應序列。該些第二影像處理器係遮蔽該些已識別主體以產生已遮蔽影像，及處理該些已遮蔽影像以識別並分類影像之該些相應序列中的該些影像中所表示之背景改變。In the embodiment described herein, the system uses a plurality of cameras to generate individual sequences of images of corresponding viewing fields in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras. The system includes a first image processor including a subject image recognition engine that receives a corresponding sequence of images from the plurality of cameras. The first image processors process the images to identify the subjects represented in the images in the corresponding sequences of the images. The system further includes a second image processor, including a background image recognition engine, which receives a corresponding sequence of images from the plurality of cameras. The second image processors mask the identified subjects to generate masked images, and process the masked images to identify and classify background changes represented in the images in the corresponding sequences of images.

於一實施例中，該些背景影像辨識引擎包含卷積神經網路。該系統包括用以使已識別背景改變與已識別主體關聯的邏輯。In one embodiment, the background image recognition engines include a convolutional neural network. The system includes logic to associate an identified background change with an identified subject.

於一實施例中，該些第二影像處理器包括背景影像儲存，用以儲存影像之相應序列的背景影像。該些第二影像處理器進一步包括遮蔽邏輯，用以處理影像之該些序列中的影像來取代其表示該些已識別主體之前台影像資料以背景影像資料。該背景影像資料被收集自影像之該些相應序列的該些背景影像以提供該些已遮蔽影像。In one embodiment, the second image processors include a background image storage for storing background images of corresponding sequences of the images. The second image processors further include masking logic to process the images in the sequences of the images instead of the background image data representing the front stage image data of the identified subjects. The background image data is collected from the background images of the corresponding sequences of the images to provide the occluded images.

於一實施例中，該遮蔽邏輯係結合影像之該些序列中的多組N個已遮蔽影像以產生針對各相機之因數化影像的序列。該些第二影像處理器藉由處理因數化影像的該序列以識別並分類背景改變。In one embodiment, the occlusion logic combines a plurality of groups of N occluded images in the sequences of images to generate a sequence of factorized images for each camera. The second image processors process the sequence of factorized images to identify and classify background changes.

於一實施例中，該些第二影像處理器包括用以產生針對影像之該些相應序列的改變資料結構之邏輯。該些改變資料結構包括已識別背景改變之已遮蔽影像中的座標、該些已識別背景改變之存貨項目主體的識別符及該些已識別背景改變之類別。該些第二影像處理器進一步包括協調邏輯，用以處理來自具有重疊觀看域之多組相機的改變資料結構來找出真實空間中之該些已識別背景改變。In one embodiment, the second image processors include logic to generate a changed data structure for the corresponding sequences of the image. The changed data structure includes the coordinates in the occluded image of the identified background change, the identifier of the inventory item subject of the identified background change, and the category of the identified background change. The second image processors further include coordination logic to process the changed data structure from a plurality of cameras with overlapping viewing fields to find the identified background changes in real space.

於一實施例中，該些改變資料結構中之已識別背景改變的該些類別係指示該已識別存貨項目是否相對於該背景影像而被加入或移除。In an embodiment, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item is added or removed relative to the background image.

於另一實施例中，該些改變資料結構中之已識別背景改變的該些類別係指示該已識別存貨項目是否相對於該背景影像而被加入或移除。該系統進一步包括用以使背景改變與已識別主體關聯的邏輯。最後，該系統包括執行由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之邏輯。In another embodiment, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item is added or removed relative to the background image. The system further includes logic to associate a background change with the identified subject. Finally, the system includes logic to perform the detection of the inventory items taken by the identified entities and the detection of the inventory items dropped by the identified entities on the inventory display structure.

於另一實施例中，該系統包括用以使背景改變與已識別主體關聯的邏輯。該系統進一步包括執行由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之邏輯。In another embodiment, the system includes logic to associate a background change with the identified subject. The system further includes logic to perform detection of the inventory items taken by the identified entities and detection of the inventory items dropped by the identified entities on the inventory display structure.

該系統可包括如文中所述之第三影像處理器，包括前台影像辨識引擎，接收來自該些複數相機之影像的相應序列。該些第三影像處理器係處理影像以識別並分類影像之該些相應序列中的該些影像中所表示之前台改變。The system may include a third image processor as described herein, including a front-end image recognition engine that receives a corresponding sequence of images from the plurality of cameras. The third image processors process the images to identify and classify the preceding stage changes represented in the images in the corresponding sequences of the images.

提供一種系統及用於操作系統之方法，用以追蹤真實空間中之多關節主體(諸如人)。該系統係使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。該序列處理影像之該些序列中的影像以產生相應於各影像之關節資料結構的陣列。相應於特定影像之關節資料結構的陣列係藉由關節類型、特定影像之時間、及特定影像中之元件的座標來分類特定影像之元件。該系統接著將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。最後，該系統係識別符候選關節之群集，其中該些群集包括具有真實空間中之座標的候選關節之各別集合，成為真實空間中之多關節主體。A system and method for operating a system are provided for tracking multiple joint subjects (such as people) in real space. The system uses a plurality of cameras to generate individual sequences of images of corresponding viewing fields in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras. The sequence processes the images in the sequences to generate an array of joint data structures corresponding to each image. The array of the joint data structure corresponding to a specific image is used to classify the components of a specific image by the joint type, the time of the specific image, and the coordinates of the components in the specific image. The system then transforms the coordinates of the elements in the array of joint data structures corresponding to the images in different sequences into candidate joints with coordinates in real space. Finally, the system identifies clusters of candidate joints, where the clusters include respective sets of candidate joints with coordinates in real space, becoming multi-joint subjects in real space.

於一實施例中，該些影像辨識引擎包含卷積神經網路。藉由影像辨識引擎之影像的處理包括產生該影像之元件的信心陣列。影像之特定元件的信心陣列包括該特定元件之複數關節類型的信心值。該些信心陣列被用以根據信心陣列來選擇該特定元件之關節資料結構的關節類型。In one embodiment, the image recognition engines include a convolutional neural network. The processing of the image by the image recognition engine includes a confidence array of components that generate the image. The confidence array of a particular component of the image includes confidence values for a plurality of joint types of the particular component. The confidence arrays are used to select the joint type of the joint data structure of the specific component according to the confidence array.

於用以追蹤多關節主體之系統的一實施例中，識別候選關節之集合包含根據介於真實空間中之主體的關節之間的物理關係以應用啟發函數來識別候選關節之集合為多關節主體。該處理包括儲存其被識別為多關節主體的關節之該些集合。識別候選關節之集合包括判定在特定時間所取得之影像中所識別的候選關節是否符合其被識別為先前影像中之多關節主體的候選關節之該些集合之一的成員。In one embodiment of a system for tracking multi-joint subjects, identifying a set of candidate joints includes applying a heuristic function to identify the set of candidate joints as multi-joint subjects based on a physical relationship between the joints of the subject in real space . The process includes storing the sets of joints that it identified as a multi-joint subject. Identifying the set of candidate joints includes determining whether a candidate joint identified in an image acquired at a particular time matches a member of one of those sets of candidate joints that it identified as a multi-joint subject in a previous image.

於一實施例中，影像之該些序列被同步化以致其由該些複數相機所擷取的影像之該些序列的各者中之影像係代表在主體之移動通過該空間的時間刻度上之單一時點上的真實空間。In an embodiment, the sequences of the images are synchronized such that the images in each of the sequences of the images captured by the plurality of cameras are representative of the time scale of the subject's movement through the space. Real space at a single point in time.

被識別為多關節主體的候選關節之集合的成員之真實空間中的座標係識別該多關節主體之區域中的位置。於某些實施例中，該處理包括真實空間之該區域中的複數多關節主體之位置的同時追蹤。於某些實施例中，該處理包括判定該些複數多關節主體中之一多關節主體何時離開真實空間之該區域。於某些實施例中，該處理包括判定其中該多關節主體在既定時點所面對之方向。於文中所述之實施例中，該系統係使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。該系統係處理其從該些複數相機所接收的影像之該些序列中的影像以識別該些影像中所表示之主體並產生該些已識別主體之類別。最後，該系統係處理影像之該些序列中的影像之集合的已識別主體之該些類別以檢測由已識別主體取走存貨項目以及由已識別主體放下存貨項目於貨架上。The coordinate system in the real space of a member of the set of candidate joints identified as the multi-joint subject identifies the position in the region of the multi-joint subject. In some embodiments, the processing includes simultaneous tracking of the positions of the plurality of multi-joint subjects in the area of real space. In some embodiments, the process includes determining when one of the plurality of multi-articulated subjects leaves the region of real space. In some embodiments, the process includes determining a direction in which the multi-joint subject faces at a given point in time. In the embodiment described herein, the system uses a plurality of cameras to generate individual sequences of images of corresponding viewing fields in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras. The system processes the images in the sequences of the images it receives from the plurality of cameras to identify the subjects represented in the images and generates categories of the identified subjects. Finally, the system processes the categories of the identified subjects of the collection of images in the sequences of the images to detect the removal of the inventory item by the identified subject and the placement of the inventory item on the shelf by the identified subject.

於一實施例中，該類別係識別該已識別主體是否持有存貨項目。該類別亦識別該已識別主體的手是否接近貨架或者該已識別主體的手是否接近該已識別主體。該手是否接近該已識別主體之該類別可包括該已識別主體的手是否接近一與已識別主體相關的籃子、及是否接近該已識別主體之本體。In one embodiment, the category identifies whether the identified entity holds inventory items. This category also identifies whether the hand of the identified subject is close to the shelf or whether the hand of the identified subject is close to the identified subject. The category of whether the hand is close to the identified subject may include whether the hand of the identified subject is close to a basket related to the identified subject, and whether it is close to the body of the identified subject.

藉由所描述之技術，代表該觀看域中之主體的手之影像可被處理以產生時間序列中之複數影像中的該主體的手之類別。來自影像之序列的手之類別可被處理，使用卷積神經網路(於某些實施例中)，以識別藉由該主體之動作。該些動作可為存貨項目之放下及取走(如文中所述之實施例中所述者)、或者為可藉由處理手的影像而解密之其他類型的動作。With the described technique, an image representing the subject's hand in the viewing domain can be processed to generate categories of the subject's hand in a plurality of images in a time series. The categories of hands from the sequence of images can be processed using a convolutional neural network (in some embodiments) to identify actions by the subject. These actions may be the dropping and removing of inventory items (such as those described in the embodiment described herein), or other types of actions that can be decrypted by processing the image of the hand.

藉由所描述之技術，影像被處理以識別該觀看域中之主體、並定位該些主體的關節。該些主體的關節之定位可被處理如文中所述以識別其包括該些主體的手之相應影像中的定界框(bounding boxes)。該些定界框內之資料可為相應影像中之該主體的手之已處理類別。來自一已識別主體(其係以此方式而被產生自影像之序列)的手之類別可被處理以識別藉由該主體之動作。With the described technique, the images are processed to identify subjects in the viewing domain and locate the joints of the subjects. The positioning of the joints of the subjects can be processed as described herein to identify bounding boxes in the corresponding images that include the hands of the subjects. The data in the bounding boxes can be the processed category of the subject's hand in the corresponding image. The category of hand from an identified subject, which is generated from a sequence of images in this manner, can be processed to identify actions by that subject.

於包括複數影像辨識引擎(諸如前台及背景影像辨識引擎兩者)之系統中，該系統可執行：由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合、以及由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合。用以處理檢測之該些第一及第二集合的選擇邏輯可被使用以產生日誌資料結構。日誌資料結構包括針對已識別主體之存貨項目的列表。In a system including a plurality of image recognition engines (such as both a front desk and a background image recognition engine), the system can perform: detection of the inventory items taken by the identified subjects and drop of inventory items by the identified subjects in inventory The first set of inspections on the display structure, and the second set of inspections on the inventory display structure by which the identified entities take inventory items away and the identified entities drop the inventory items. The selection logic used to process the first and second sets of tests can be used to generate a log data structure. The log data structure includes a list of inventory items for identified entities.

於文中所述之實施例中，來自複數相機中之相機的影像之該些序列被同步化。相同的相機及影像之相同的序列係由一較佳實施方式中之前台及背景影像處理器兩者所使用。結果，存貨項目之放下及取走的冗餘檢測係使用相同的輸入資料而被執行，以容許高信心(及高準確度)於所得資料中。In the embodiment described herein, the sequences of images from the cameras in the plurality of cameras are synchronized. The same camera and the same sequence of images are used by both the front stage and the background image processor in a preferred embodiment. As a result, the redundant detection of put-down and take-out of inventory items is performed using the same input data to allow high confidence (and high accuracy) in the obtained data.

於文中所述之一種技術中，該系統包含藉由識別主體之姿勢及與影像之該些序列中所表示的該些姿勢相關的存貨項目以檢測存貨項目之放下及取走的邏輯。此可使用前台影像辨識引擎配合如文中所述之主體影像辨識引擎來完成。In one of the techniques described herein, the system includes logic that detects the placement and removal of inventory items by identifying the posture of the subject and the inventory items related to the postures represented in the sequences of the images. This can be done using the front-end image recognition engine in conjunction with the subject image recognition engine described in the text.

於文中所述之另一技術中，該系統包含用以藉由以下方式來檢測存貨項目之放下及取走的邏輯：隨著時間經過語意地識別存貨展示結構(諸如貨架)上之存貨項目的顯著改變以及使該些語意地顯著改變與影像之該些序列中所表示的主體相關聯。此可使用背景影像辨識引擎配合如文中所述之主體影像辨識引擎來完成。In another technique described herein, the system includes logic to detect drop and take-off of inventory items by: semantically identifying inventory items on inventory display structures (such as shelves) over time Significant changes and associated semantically significant changes with the subjects represented in the sequences of the image. This can be accomplished using a background image recognition engine in conjunction with a subject image recognition engine as described in the text.

於應用文中所述之系統時，姿勢分析及語意差異分析兩者可被結合，且被執行於來自相機之陣列的同步化影像之相同序列上。In applying the system described in the article, both pose analysis and semantic difference analysis can be combined and performed on the same sequence of synchronized images from the camera's array.

其可由電腦系統所執行之方法及電腦程式產品亦被描述於文中。Methods and computer program products that can be executed by a computer system are also described herein.

本發明之其他形態及優點可見於圖式、詳細描述及申請專利範圍(其接續於下)之檢閱上。Other forms and advantages of the present invention can be found in the drawings, detailed description, and review of the scope of patent application (which is continued below).

以下描述被提出以致能任何熟悉本技術人士執行及使用本發明，且被提供於特定應用及其需求的背景。對於所揭露實施例之各種修改將是那些熟悉此技藝人士能輕易瞭解的，且文中所定義之一般性原理可被應用於其他實施例及應用而不背離本發明之精神及範圍。因此，本發明不欲被限制於所示之實施例而是被賦予符合文中所揭露之原理及特徵的最寬範圍。系統概述The following description is presented to enable any person skilled in the art to implement and use the invention and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily understood by those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the invention. Therefore, the present invention is not intended to be limited to the illustrated embodiments but is to be accorded the widest scope consistent with the principles and features disclosed herein. System Overview

主技術之一種系統及各個實施方式係參考圖1-28A/28B而被描述。系統及程序係參考圖1而被描述，依據實施方式之系統的架構階概圖。因為圖1為架構圖，所以某些細節被省略以增進描述之清晰。A system and various embodiments of the main technology are described with reference to FIGS. 1-28A / 28B. The system and program are described with reference to FIG. 1 and are schematic diagrams of the architecture of the system according to the embodiment. Because Figure 1 is an architectural diagram, some details have been omitted to improve the clarity of the description.

圖1之討論係如下。首先，描述系統之元件，接續以其互連。接著，更詳細地描述該系統中之元件的使用。The discussion of Figure 1 is as follows. First, the components of the system will be described, followed by their interconnection. Next, the use of components in the system is described in more detail.

圖1提供一種系統100之方塊圖階圖示。系統100包括相機114、主控影像辨識引擎112a、112b、和112n之網路節點、部署於該網路上之網路節點(或節點)中的追蹤引擎110、校準器120、主體資料庫140、訓練資料庫150、啟發法資料庫160(用於關節啟發法、用於放下及取走啟發法、及其他用以協調和組合如以下所述之多數影像辨識引擎的輸出之其他啟發法)、校準資料庫170、及通訊網路或網路181。網路節點可主控僅一影像辨識引擎、或數個影像辨識引擎，如文中所述者。系統亦可包括存貨資料庫及其他支援資料。FIG. 1 provides a block diagram diagram of a system 100. The system 100 includes a camera 114, network nodes controlling the image recognition engines 112a, 112b, and 112n, a tracking engine 110, a calibrator 120, a main body database 140, and a network node (or node) deployed on the network. Training database 150, heuristic database 160 (for joint heuristics, for putting down and removing heuristics, and other heuristics for coordinating and combining the output of most image recognition engines as described below), Calibration database 170, and communication network or network 181. The network node can control only one image recognition engine, or several image recognition engines, as described in the text. The system can also include an inventory database and other supporting data.

如文中所使用，網路節點為一種裝附至網路之可定址硬體裝置或虛擬裝置，且能夠透過通訊頻道以傳送、接收、或傳遞資訊至或自其他網路節點。其可被部署為硬體網路節點之電子裝置的範例包括電腦、工作站、膝上型電腦、手持式電腦、及智慧型手機之所有種類。網路節點可被實施於雲端為基的伺服器系統中。組態成網路節點之多於一個虛擬裝置可使用單一實體裝置來實施。As used herein, a network node is an addressable hardware device or a virtual device attached to a network and capable of transmitting, receiving, or passing information to or from other network nodes through a communication channel. Examples of electronic devices that can be deployed as hardware network nodes include computers, workstations, laptops, handheld computers, and all types of smartphones. Network nodes can be implemented in a cloud-based server system. More than one virtual device configured as a network node can be implemented using a single physical device.

為了簡化之緣故，僅顯示三個主控影像辨識引擎之網路節點於系統100中。然而，任何數目的主控影像辨識引擎之網路節點均可透過網路181而被連接至追蹤引擎110。同時，影像辨識引擎、追蹤引擎及文中所述之其他處理引擎可使用分散式架構中之多於一個網路節點來執行。For the sake of simplicity, only three network nodes controlling the image recognition engine are displayed in the system 100. However, any number of network nodes controlling the image recognition engine can be connected to the tracking engine 110 through the network 181. At the same time, the image recognition engine, tracking engine, and other processing engines described herein may be executed using more than one network node in a decentralized architecture.

現在將描述系統100之元件的互連。網路181耦合網路節點101a、101b、和101c，各別地，其係主控影像辨識引擎112a、112b、和112n、主控追蹤引擎110之網路節點102、調校器120、主體資料庫140、訓練資料庫150、關節啟發法資料庫160、及調校資料庫170。相機114透過主控影像辨識引擎112a、112b、和112n之網路節點而被連接至追蹤引擎110。於一實施例中，相機114被安裝於購物商店(諸如超級市場)中以致其具有重疊觀看域之相機114(二或更多)的集合被置於各走道上方以擷取該商店中之真實空間的影像。於圖1中，兩個相機被配置於走道116a上方，兩個相機被配置於走道116b上方，而三個相機被配置於走道116n上方。相機114被安裝於具有重疊觀看域之走道上方。於此一實施例中，相機被組態以如下目標：移動在購物商店之走道中的消費者係於任何時點出現在二或更多相機的觀看域中。The interconnection of the elements of the system 100 will now be described. The network 181 couples the network nodes 101a, 101b, and 101c, which are, respectively, the master image recognition engines 112a, 112b, and 112n, the network node 102, the adjuster 120, and the main data of the master tracking engine 110. The database 140, the training database 150, the joint heuristics database 160, and the adjustment database 170. The camera 114 is connected to the tracking engine 110 through network nodes controlling the image recognition engines 112a, 112b, and 112n. In one embodiment, the camera 114 is installed in a shopping store (such as a supermarket) so that its collection of cameras 114 (two or more) with overlapping viewing fields is placed above each aisle to capture the reality in the store Imagery of space. In FIG. 1, two cameras are disposed above the aisle 116a, two cameras are disposed above the aisle 116b, and three cameras are disposed above the aisle 116n. The camera 114 is installed above a walkway with overlapping viewing fields. In this embodiment, the camera is configured with the goal that a consumer moving in the aisle of a shopping store appears in the viewing domain of two or more cameras at any point in time.

相機114可於時間上被彼此同步化，以致其影像被同時地(或時間上接近地)並以相同的影像擷取率來擷取。相機114可以預定速率傳送影像之各別連續串流至主控影像辨識引擎112a-112n之網路節點。於其同時地(或時間上接近地)覆蓋真實空間之區域的所有相機中所擷取的影像被同步化，由於其同步化影像可被識別於處理引擎中如代表具有真實空間中之固定位置的主體之不同視角。例如，於一實施例中，相機係以每秒30框(fps)之速率將影像框傳送至各別主控影像辨識引擎112a-112n之網路節點。各框具有時戳、相機之識別(縮寫為「相機_id」)、及框識別(縮寫為「框_id」)，連同影像資料。The cameras 114 may be synchronized with each other in time such that their images are captured simultaneously (or close in time) and at the same image capture rate. The cameras 114 can transmit the respective continuous streams of the images to the network nodes of the master image recognition engines 112a-112n at a predetermined rate. The images captured by all the cameras that simultaneously (or closely in time) cover the area of real space are synchronized, because the synchronized images can be identified in the processing engine, such as representing a fixed position in real space Different perspectives of the subject. For example, in one embodiment, the camera transmits image frames to the network nodes of the respective master image recognition engines 112a-112n at a rate of 30 frames per second (fps). Each frame has a time stamp, camera identification (abbreviated as "camera_id"), and frame identification (abbreviated as "frame_id"), along with image data.

安裝於走道上方之相機被連接至各別影像辨識引擎。例如，於圖1中，安裝於走道116a上方的兩個相機被連接至主控影像辨識引擎112a之網路節點101a。同樣地，安裝於走道116b上方的兩個相機被連接至主控影像辨識引擎112b之網路節點101b。於網路節點或節點101a-101n中所主控的各影像辨識引擎112a-112n係分別地處理從各於所示範例中之一相機所接收的影像框。The cameras installed above the aisle are connected to respective image recognition engines. For example, in FIG. 1, two cameras installed above the aisle 116a are connected to the network node 101a of the master image recognition engine 112a. Similarly, the two cameras installed above the aisle 116b are connected to the network node 101b of the master image recognition engine 112b. Each of the image recognition engines 112a-112n controlled by the network nodes or nodes 101a-101n separately processes the image frames received from one of the cameras in the illustrated example.

於一實施例中，各影像辨識引擎112a、112b、及112n被實施為深學習演算法，諸如卷積神經網路(縮寫為CNN)。於此一實施例中，CNN係使用訓練資料庫150來訓練。於文中所述之實施例中，真實空間中之主體的影像辨識係根據識別並群集該些影像中可辨識的關節，其中關節之群組可被歸屬於一各別主體。針對此關節為基的分析，訓練資料庫150具有針對主體之每一不同類型的關節之影像的大型集合。於購物商店之範例實施例中，主體為移動於貨架之間的走道中之消費者。於一範例實施例中，於CNN之訓練期間，系統100被稱為「訓練系統」。在使用訓練資料庫150以訓練CNN之後，CNN被切換至產生模式以即時地處理購物商店中之消費者的影像。於一範例實施例中，於產生期間，系統100被稱為運行時間系統(亦稱為推理系統)。各影像辨識引擎中之CNN係產生影像之關節資料結構的陣列於影像之其各別串流中。於如文中所述之一實施例中，關節資料結構之陣列係針對各已處理影像而被產生，以致其各影像辨識引擎112a-112n係產生關節資料結構之陣列的輸出串流。來自具有重疊觀看域之相機的關節資料結構之這些陣列被進一步處理以形成關節之群組，並識別關節之此等群組為主體。In one embodiment, each of the image recognition engines 112a, 112b, and 112n is implemented as a deep learning algorithm, such as a convolutional neural network (abbreviated as CNN). In this embodiment, the CNN is trained using the training database 150. In the embodiment described herein, the image recognition of subjects in real space is based on identifying and clustering identifiable joints in the images, where the group of joints can be assigned to a separate subject. For this joint-based analysis, the training database 150 has a large collection of images for each different type of joint of the subject. In an exemplary embodiment of a shopping store, the subject is a consumer moving in a hallway between shelves. In an exemplary embodiment, during training of a CNN, the system 100 is referred to as a "training system". After using the training database 150 to train the CNN, the CNN is switched to a generation mode to process images of consumers in a shopping store in real time. In an exemplary embodiment, during generation, the system 100 is referred to as a runtime system (also known as an inference system). The CNN in each image recognition engine generates an array of joint data structures of the images in their respective streams of the images. In one embodiment as described herein, the array of joint data structures is generated for each processed image, so that each of its image recognition engines 112a-112n generates an output stream of the array of joint data structures. These arrays of joint data structures from cameras with overlapping viewing fields are further processed to form groups of joints, and these groups of joints are identified as subjects.

相機114被調校在將CNN切換至產生模式之前。調校器120係調校該些相機並將調校資料儲存於調校資料庫170中。Camera 114 is tuned before switching CNN to production mode. The calibrator 120 calibrates the cameras and stores the calibration data in the calibration database 170.

追蹤引擎110(於網路節點102上主控)接收來自影像辨識引擎112a-112n之主體的關節資料結構之陣列的連續串流。追蹤引擎110係處理關節資料結構的陣列並將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。針對同步化影像之各集合，遍及該真實空間所識別之候選關節的組合可被考量(為了類比之目的)為如同候選關節之星系。針對各後續時點，候選關節之移動被記錄以致其該星系隨著時間而改變。追蹤引擎110之輸出被儲存於主體資料庫140中。The tracking engine 110 (hosted on the network node 102) receives a continuous stream of arrays of joint data structures from the subjects of the image recognition engines 112a-112n. The tracking engine 110 processes an array of joint data structures and transforms the coordinates of elements in the array of joint data structures corresponding to the images in different sequences into candidate joints with coordinates in real space. For each set of synchronized images, the combination of candidate joints identified throughout the real space can be considered (for analogy) as a galaxy as a candidate joint. For each subsequent point in time, the movement of the candidate joint is recorded so that its galaxy changes over time. The output of the tracking engine 110 is stored in the subject database 140.

追蹤引擎110使用邏輯以將具有真實空間中之座標的候選關節之群組或集合識別為該真實空間中之主體。為了類比之目的，候選點之各集合係如同在各時點上之候選關節的群集。候選關節的群集可隨著時間而移動。The tracking engine 110 uses logic to identify groups or sets of candidate joints with coordinates in the real space as subjects in the real space. For the purpose of analogy, each set of candidate points is like a cluster of candidate joints at each point in time. The cluster of candidate joints can move over time.

用以識別候選關節之集合的該邏輯包含根據真實空間中之主體之間的物理關係之啟發函數。這些啟發函數被用以將候選關節之集合識別為主體。啟發函數被儲存於啟發法資料庫160中。追蹤引擎110之輸出被儲存於主體資料庫140中。因此，候選關節之該些集合包含其具有與其他各別候選關節之依據啟發參數的關係之各別候選關節、以及其已被識別(或可被識別)為各別主體之既定集合中的候選關節之子集。The logic to identify a set of candidate joints includes a heuristic function based on the physical relationship between subjects in real space. These heuristic functions are used to identify sets of candidate joints as subjects. Heuristic functions are stored in a heuristic database 160. The output of the tracking engine 110 is stored in the subject database 140. Therefore, the sets of candidate joints include individual candidate joints that have a relationship with other individual candidate joints based on heuristic parameters, and candidates that have been identified (or can be identified) as a given set of individual subjects. A subset of joints.

通過網路181之實際通訊路徑可為透過公共及/或私人網路之點對點。通訊可透過多種網路182而發生，例如，私人網路、VPN、MPLS電路、或網際網路；並可使用適當的應用程式編程介面(API)及資料交換格式，例如，表示狀態轉移(REST)、JavaScript^TM 物件記法(JSON)、可延伸式標示語言(XML)、簡單物件存取協定(SOAP)、Java^TM 訊息服務(JMS)、及/或Java平台模組系統。所有該些通訊均可被加密。通訊通常係透過網路，諸如LAN(區域網路)、WAN(廣域網路)、電話網路(公用切換式電話網路PSTN))、對話啟動協定(SIP)、無線網路、點對點網路、星形網路、符記環網路、集線器網路、網際網路，包括經由諸如EDGE、3G、4G LTE、Wi-Fi、及WiMAX等協定之行動網際網路。此外，諸如使用者名稱/通行碼、開放式授權(OAuth)、Kerberos、SecureID、數位憑證及更多等多種授權和鑑別技術可被使用以確保通訊。The actual communication path through the network 181 may be a point-to-point through a public and / or private network. Communication can occur over a variety of networks 182, such as private networks, VPNs, MPLS circuits, or the Internet; and appropriate application programming interfaces (APIs) and data exchange formats such as Representational State Transfer (REST) ), JavaScript ^TM Object Notation (JSON), Extensible Markup Language (XML), Simple Object Access Protocol (SOAP), Java ^TM Message Service (JMS), and / or Java platform module system. All these communications can be encrypted. Communication is usually through a network such as LAN (Local Area Network), WAN (Wide Area Network), Telephone Network (Public Switched Telephone Network PSTN)), Session Initiation Protocol (SIP), Wireless Network, Point-to-Point Network, Star network, token ring network, hub network, Internet, including mobile Internet via protocols such as EDGE, 3G, 4G LTE, Wi-Fi, and WiMAX. In addition, multiple authorization and authentication technologies such as username / password, open authorization (OAuth), Kerberos, SecureID, digital credentials, and more can be used to ensure communication.

文中所揭露之技術可被實施於任何電腦實施系統之背景下，包括資料庫系統、多租戶環境、或關連式資料庫實施方式，如Oracle™相容的關連式實施方式、IBM DB2 Enterprise Server™相容的關連式實施方式、MySQL™或PostgreSQL™相容的關連式實施方式或者Microsoft SQL Server™相容的關連式實施方式；或者NoSQL™非關連式資料庫實施方式，諸如Vampire™相容的非關連式實施方式、Apache Cassandra™相容的非關連式實施方式、BigTable™相容的非關連式實施方式或者HBase™或DynamoDB™相容的非關連式實施方式。此外，所揭露之技術可使用以下來實施：不同的編程模型，如MapReduce™、大塊同步編程、MPI基元等等；或不同的可擴縮批次和串流管理系統，如Apache Storm™、Apache Spark™、Apache Kafka™、Apache Flink™、Truviso™、Amazon Elasticsearch Service™、Amazon Web Services™ (AWS)、IBM Info-Sphere™、Borealis™、及Yahoo! S4™。相機配置The techniques disclosed in this article can be implemented in the context of any computer-implemented system, including database systems, multi-tenant environments, or connected database implementations, such as Oracle ™ -compatible connected implementations, IBM DB2 Enterprise Server ™ Compatible Connected Implementation, MySQL ™ or PostgreSQL ™ Compatible Connected Implementation or Microsoft SQL Server ™ Compatible Connected Implementation; or NoSQL ™ Non-Related Database Implementation, such as Vampire ™ Compatible Unconnected implementation, Apache Cassandra ™ compatible unconnected implementation, BigTable ™ compatible unconnected implementation, or HBase ™ or DynamoDB ™ compatible unconnected implementation. In addition, the disclosed techniques can be implemented using: different programming models, such as MapReduce ™, large block synchronous programming, MPI primitives, etc .; or different scalable batch and stream management systems, such as Apache Storm ™ , Apache Spark ™, Apache Kafka ™, Apache Flink ™, Truviso ™, Amazon Elasticsearch Service ™, Amazon Web Services ™ (AWS), IBM Info-Sphere ™, Borealis ™, and Yahoo! S4 ™. Camera configuration

相機114被配置以追蹤三維(縮寫為3D)真實空間中之多關節單體(或主體)。於購物商店之範例實施例中，真實空間可包括其中銷售項目被堆疊於貨架中的購物商店之區域。真實空間中的點可由(x, y, z)座標系統來表示。該系統所被部署之真實空間的區域中之各點係由二或更多相機114之觀看域所涵蓋。The camera 114 is configured to track a multi-joint cell (or subject) in a three-dimensional (abbreviated as 3D) real space. In an exemplary embodiment of a shopping store, the real space may include an area of a shopping store in which sales items are stacked in a shelf. Points in real space can be represented by the (x, y, z) coordinate system. Points in the area of the real space in which the system is deployed are covered by the viewing field of the two or more cameras 114.

於購物商店中，貨架及其他存貨展示結構可被配置以多種方式，諸如沿著購物商店之牆壁、或者於形成走道之列中或者兩種配置之組合。圖2顯示貨架(其形成走道116a)之配置，從走道116a之一端所見。兩個相機(相機A 206及相機B 208)被置於走道116a上方，以一段離開存貨展示結構(諸如貨架)之上的購物商店之屋頂230及地板220的預定距離。相機114包含配置於上方並具有涵蓋真實空間中之存貨展示結構及地板區域的各別部分之觀看域的相機。被識別為主體的候選關節之集合的成員之真實空間中的座標係識別該主體之地板區域中的位置。於購物商店之範例實施例中，真實空間可包括購物商店中之所有地板220，以供存貨可從該處被存取。相機114被放置且定向以致其地板220及貨架的區域可由至少兩個相機所看見。相機114亦覆蓋貨架202和204之至少部分以及貨架202和204前方之地板空間。相機角度被選擇以具有陡峭觀點(筆直向下)、及有角度的觀點(其提供消費者之更完整的身體影像)兩者。於一範例實施例中，相機114被組態以八(8)英尺高或更高，遍及該購物商店。圖13提出此一實施例之圖示。In a shopping store, shelves and other inventory display structures can be configured in a variety of ways, such as along a wall of a shopping store, or in a row forming a walkway, or a combination of both configurations. Figure 2 shows the configuration of the shelf (which forms the aisle 116a), as seen from one end of the aisle 116a. Two cameras (camera A 206 and camera B 208) are placed above the aisle 116a at a predetermined distance from the roof 230 and the floor 220 of the shopping store above the inventory display structure, such as a shelf. The camera 114 includes a camera disposed above and having a viewing area covering various parts of the inventory display structure and floor area in a real space. The coordinates in the real space of the members of the set of candidate joints identified as the subject identify positions in the subject's floor area. In an exemplary embodiment of a shopping store, the real space may include all floors 220 in the shopping store for inventory to be accessed there. The camera 114 is placed and oriented so that the area of its floor 220 and shelf can be seen by at least two cameras. The camera 114 also covers at least a portion of the shelves 202 and 204 and the floor space in front of the shelves 202 and 204. The camera angle is selected to have both a steep perspective (straight down) and an angled perspective (which provides a more complete body image of the consumer). In an exemplary embodiment, the camera 114 is configured to be eight (8) feet high or higher throughout the shopping store. Figure 13 presents a diagram of this embodiment.

於圖2中，相機206及208具有重疊觀看域，其覆蓋介於貨架A 202與貨架B 204之間的空間，各別地以重疊觀看域216及218。真實空間中之位置被表示為真實空間座標系統中之(x, y, z)點。「x」及「y」代表二維(2D)平面(其可為購物商店之地板220)上之位置。值「z」為一種組態中在地板220上之2D平面上方的點之高度。In FIG. 2, the cameras 206 and 208 have overlapping viewing domains, which cover the space between the shelf A 202 and the shelf B 204, and the viewing domains 216 and 218 overlap each other. The position in real space is represented as the (x, y, z) point in the real space coordinate system. "X" and "y" represent positions on a two-dimensional (2D) plane (which may be the floor 220 of a shopping store). The value "z" is the height of a point above a 2D plane on the floor 220 in a configuration.

圖3闡明從圖2之頂部所觀看的走道116a，其進一步顯示走道116a上方之相機206及208的位置之範例配置。相機206及208被放置更接近於走道116a之相反端。相機A 206被置於離開貨架A 202一段預定距離，而相機B 208被置於離開貨架B 204一段預定距離。於另一實施例中，其中多於兩個相機被置於走道上方，該些相機被置於與彼此相等的距離上。於此一實施例中，兩個相機被放置接近於相反端而第三個相機被置於該走道的中間。應理解：數個不同的相機配置是可能的。相機調校FIG. 3 illustrates the aisle 116a viewed from the top of FIG. 2, which further shows an example configuration of the positions of the cameras 206 and 208 above the aisle 116a. The cameras 206 and 208 are placed closer to the opposite end of the aisle 116a. Camera A 206 is placed a predetermined distance from shelf A 202, and camera B 208 is placed a predetermined distance from shelf B 204. In another embodiment, more than two cameras are placed above the aisle, and the cameras are placed at equal distances from each other. In this embodiment, two cameras are placed near opposite ends and a third camera is placed in the middle of the aisle. It should be understood that several different camera configurations are possible. Camera calibration

相機調校器120履行兩種類型的調校：內部及外部。於內部調校中，相機114之內部參數被調校。內部相機參數之範例包括聚焦長度、主要點、偏斜、魚眼係數，等等。用於內部相機調校之多種技術可被使用。一此種技術係由張(Zhang)所提出於「用於相機調校之彈性新技術(A flexible new technique for camera calibration)」，其係發佈於IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22, No. 11, 2000年11月。The camera adjuster 120 performs two types of adjustments: internal and external. During the internal adjustment, the internal parameters of the camera 114 are adjusted. Examples of internal camera parameters include focus length, principal points, skew, fisheye coefficient, and so on. Various techniques for internal camera calibration can be used. One such technology was proposed by Zhang in "A flexible new technique for camera calibration", which was published in IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22, No. 11, November 2000.

於外部調校中，外部相機參數被調校以產生用以將2D影像資料變換為真實空間中之3D座標的映射參數。於一實施例中，一主體(諸如人)被引入真實空間。該主體移動穿越該真實空間，於一通過每個相機114之觀看域的路徑上。在該真實空間中之任何既定點上，該主體係出現於其形成3D場景之至少兩個相機的觀看域中。然而，該兩個相機具有相同3D場景之不同的觀點於其各別的二維(2D)影像平面中。3D場景中之特徵(諸如該主體之左手腕)係由其各別2D影像平面中之不同位置上的兩個相機所觀看。In external calibration, external camera parameters are adjusted to generate mapping parameters for transforming 2D image data into 3D coordinates in real space. In one embodiment, a subject, such as a person, is introduced into real space. The subject moves through the real space on a path through the viewing field of each camera 114. At any given point in the real space, the master system appears in the viewing domain of at least two cameras that form a 3D scene. However, the two cameras have different views of the same 3D scene in their respective two-dimensional (2D) image planes. Features in a 3D scene, such as the subject's left wrist, are viewed by two cameras at different positions in their respective 2D image planes.

點對應被建立於每一對相機與既定場景的重疊觀看域之間。因為各相機具有相同3D場景之不同觀點，所以點對應為二像素位置(一位置係來自具有重疊觀看域之各相機的位置)，其代表3D場景中之相同點的投影。許多點對應係使用影像辨識引擎112a-112n的結果而針對各3D場景被識別，以供外部調校之目的。影像辨識引擎將關節之位置識別為各別相機114之2D影像平面中的像素之(x, y)座標，諸如列與行數。於一實施例中，關節為該主體之19個不同類型的關節之一。隨著該主體移動通過不同相機之觀看域，追蹤引擎110接收其用於來自每影像之相機114的調校之該主體的19個不同類型的關節之各者的(x, y)座標。Point correspondences are established between each pair of cameras and the overlapping viewing domain of a given scene. Because each camera has a different perspective on the same 3D scene, the point corresponds to a two-pixel position (a position from the positions of the cameras with overlapping viewing fields), which represents the projection of the same point in the 3D scene. Many points are identified for each 3D scene using the results of the image recognition engines 112a-112n for external calibration purposes. The image recognition engine recognizes the position of the joint as the (x, y) coordinates of pixels in the 2D image plane of each camera 114, such as the number of columns and rows. In one embodiment, the joint is one of 19 different types of joints of the subject. As the subject moves through the viewing fields of the different cameras, the tracking engine 110 receives the (x, y) coordinates of each of the 19 different types of joints of the subject it uses to adjust the camera 114 from each image.

例如，考量來自相機A之影像及來自相機B之影像，兩者均在相同時點被取得且具有重疊觀看域。有來自相機A之影像中的像素，其係相應於來自相機B之同步化影像中的像素。考量其有某物件或表面之特定點於相機A和相機B之觀點中以及該點被擷取於兩影像框之像素中。於外部相機調校中，多數此種點被識別且被稱為相應點。因為有一主體於調校期間之相機A和相機B的觀看域中，所以此主體之關鍵關節被識別，例如，左手腕的中心。假如這些關鍵關節可見於來自相機A和相機B兩者之影像框中，則假設這些代表相應點。此程序被重複於許多影像框以建立針對具有重疊觀看域之所有相機對的對應點之大型集合。於一實施例中，影像係以30 FPS(每秒框數)或更多之速率及全RGB(紅、綠、及藍)顏色之720像素的解析度來被串流出所有相機。這些影像具有一維陣列(亦稱為平坦陣列)之形式。For example, consider the image from camera A and the image from camera B, both of which were acquired at the same time point and have overlapping viewing areas. There are pixels in the image from camera A, which correspond to the pixels in the synchronized image from camera B. Consider that it has a certain point of an object or surface in the viewpoint of camera A and camera B and that point is captured in the pixels of the two image frames. In external camera calibration, most such points are identified and referred to as corresponding points. Because there is a subject in the viewing area of camera A and camera B during calibration, key joints of this subject are identified, for example, the center of the left wrist. If these key joints are visible in the image frames from both camera A and camera B, it is assumed that these represent the corresponding points. This process is repeated over many image frames to create a large set of corresponding points for all camera pairs with overlapping viewing fields. In one embodiment, the images are streamed out of all cameras at a rate of 30 FPS (frames per second) or more and a resolution of 720 pixels in full RGB (red, green, and blue) colors. These images have the form of a one-dimensional array (also known as a flat array).

以上針對主體而收集的大量影像可被用以判定介於具有重疊觀看域的相機之間的對應點。考量具有重疊觀看域之兩個相機A和B。通過相機A和B之中心以及3D場景中之關節位置(亦稱為特徵點)的平面被稱為「核面(epipolar plane)」。該核面與相機A和B之2D影像平面的交點係定義「核線」。給定這些對應點，判定一變換，其可準確地將來自相機A之對應點準確地映射至其被確保交叉相機B之影像框中的該對應點之相機B的觀看域中之核線。使用針對主體而收集如上之影像框，該變換被產生。於本技術中已知其此變換為非線性的。一般形式係進一步已知為需要針對各相機之鏡頭的徑向形變之補償，以及移動至和自投射空間之非線性座標變換。於外部相機調校中，對於理想的非線性變換之近似係藉由解決非線性最佳化問題來判定。此非線性最佳化函數係由追蹤引擎110所使用以識別不同影像辨識引擎112a-112n之輸出(關節資料結構之陣列)中的相同關節，處理具有重疊觀看域之相機114的影像。內部及外部相機調校之結果被儲存於調校資料庫170中。The large number of images collected for the subject above can be used to determine corresponding points between cameras with overlapping viewing fields. Consider two cameras A and B with overlapping viewing fields. The plane passing through the centers of the cameras A and B and the joint positions (also referred to as feature points) in the 3D scene is referred to as an "epipolar plane". The intersection of this core surface with the 2D image planes of cameras A and B defines the "epipolar line". Given these corresponding points, a transformation is determined, which can accurately map the corresponding points from camera A to the epipolar lines in the viewing domain of camera B whose corresponding points are ensured to cross the image frame of camera B. This transformation is generated using the image frames collected above for the subject. This transformation is known in the art to be non-linear. The general form is further known as the need to compensate for the radial deformation of the lens of each camera, as well as the non-linear coordinate transformation of movement to and from the projected space. In the external camera calibration, the approximation of the ideal nonlinear transformation is determined by solving the nonlinear optimization problem. This non-linear optimization function is used by the tracking engine 110 to identify the same joints in the output (an array of joint data structures) of different image recognition engines 112a-112n, and processes the images of the camera 114 with overlapping viewing fields. The results of the internal and external camera calibration are stored in the calibration database 170.

可使用多種技術以判定真實空間中之相機114的影像中之點的相對位置。例如，Longuet-Higgins出版了「用以從兩個投影重建場景之電腦演算法」於Nature, Volume 293，1981年九月10日。此論文係提出了從相關對的透視投影計算場景之三維結構，當介於兩投影之間的空間關係是未知的時。Longuet-Higgins的論文提出了一種技術以判定真實空間中之各相機相對於其他相機的位置。此外，他們的技術容許真實空間中之主體的三角測量，其係使用來自具有重疊觀看域之相機114的影像以識別z座標(距離地板的高度)之值。真實空間中之任意點(例如，真實空間之一角落中的貨架之末端)被指定為真實空間之(x, y, z)座標系統上的(0, 0, 0)點。Various techniques can be used to determine the relative position of points in the image of the camera 114 in real space. For example, Longuet-Higgins published "Computer Algorithms for Reconstructing Scenes from Two Projections" in Nature, Volume 293, September 10, 1981. This paper proposes to calculate the three-dimensional structure of the scene from the perspective projection of the relevant pair, when the spatial relationship between the two projections is unknown. Longuet-Higgins' paper proposes a technique to determine the position of each camera in real space relative to other cameras. In addition, their technology allows triangulation of subjects in real space, using images from cameras 114 with overlapping viewing fields to identify the value of the z coordinate (height from the floor). Any point in the real space (for example, the end of a shelf in a corner of the real space) is designated as the (0, 0, 0) point on the (x, y, z) coordinate system of the real space.

於本技術之一實施例中，外部調校之參數被儲存於兩個資料結構中。第一資料結構係儲存本質參數。本質參數係表示從3D座標變為2D影像座標之投影變換。第一資料結構含有每相機之本質參數，如以下所示。資料值均為數值的浮點數字。此資料結構係儲存3x3本質矩陣，表示為「K」及形變係數。形變係數包括六個徑向形變係數及兩個切向形變係數。徑向形變係發生在當光射線更接近於鏡頭之邊緣而彎曲(相較於在其光學中心之彎曲)時。切向形變係發生在當鏡頭與影像平面並非平行時。以下資料結構僅顯示第一相機之值。類似的資料係針對所有相機114而被儲存。 In one embodiment of the technology, externally adjusted parameters are stored in two data structures. The first data structure stores essential parameters. The essential parameter is the projection transformation from 3D coordinates to 2D image coordinates. The first data structure contains the essential parameters of each camera, as shown below. The data values are all floating point numbers. This data structure stores a 3x3 essential matrix, expressed as "K" and deformation coefficient. The deformation coefficient includes six radial deformation coefficients and two tangential deformation coefficients. Radial deformation occurs when the light ray is bent closer to the edge of the lens (compared to bending at its optical center). Tangential deformation occurs when the lens and the image plane are not parallel. The following data structure only shows the value of the first camera. Similar data is stored for all cameras 114.

第二資料結構係儲存每對相機：3x3基礎矩陣(F)、3x3基本矩陣(E)、3x4投影矩陣(P)、3x3旋轉矩陣(R)及3x1變換向量(t)。此資料被用以將一相機的參考框中之點轉換至另一相機的參考框。針對各對相機，八個單應性係數亦被儲存以映射地板220之平面從一相機至另一相機。基礎矩陣為介於相同場景的兩個影像之間的關係，其係限制來自該場景之點的投影可發生於兩個影像中的何處。基本矩陣亦為介於相同場景的兩個影像之間的關係，在其相機被調校之條件下。投影矩陣係從3D真實空間提供向量空間投影至主體。旋轉矩陣被用以履行歐幾里德空間中之旋轉。變換向量「t」代表幾何變換，其係以相同距離移動一圖形或一空間之每一點於既定方向。單應性_地板_係數被用以結合其由具有重疊觀看域之相機所觀看到的地板220上之主體的特徵之影像。第二資料結構被顯示於下。類似的資料係針對所有對相機而被儲存。如先前所指示，x代表數值的浮點數字。網路組態The second data structure stores each pair of cameras: 3x3 basic matrix (F), 3x3 basic matrix (E), 3x4 projection matrix (P), 3x3 rotation matrix (R), and 3x1 transformation vector (t). This information is used to convert points in the reference frame of one camera to the reference frame of another camera. For each pair of cameras, eight homography coefficients are also stored to map the plane of the floor 220 from one camera to another. The base matrix is a relationship between two images of the same scene, which restricts where the projection of points from that scene can occur in the two images. The basic matrix is also the relationship between two images of the same scene, with the camera being adjusted. The projection matrix provides a vector space projection from the 3D real space to the subject. The rotation matrix is used to perform rotations in Euclidean space. The transformation vector "t" represents a geometric transformation, which moves each point of a figure or a space in the predetermined direction by the same distance. The homography_floor_factor is used to combine images of features of the subject on the floor 220 as viewed by cameras with overlapping viewing fields. The second data structure is shown below. Similar information is stored for all pairs of cameras. As previously indicated, x represents a floating point number of values. Network configuration

圖4提出一種主控影像辨識引擎之網路的架構400。該系統包括複數網路節點101a-101n於所示的實施例中。於此一實施例中，網路節點亦被稱為處理平台。處理平台101a-101n及相機412、414、416、...418被連接至網路481。FIG. 4 presents a network architecture 400 for controlling an image recognition engine. The system includes a plurality of network nodes 101a-101n in the illustrated embodiment. In this embodiment, the network node is also referred to as a processing platform. Processing platforms 101a-101n and cameras 412, 414, 416, ... 418 are connected to a network 481.

圖4顯示其連接至網路之複數相機412、414、416、...418。大量相機可被部署於特定系統中。於一實施例中，相機412至418係各別地使用乙太網路為基的連接器422、424、426、及428而被連接至網路481。於此一實施例中，乙太網路為基的連接器具有每秒十億位元之資料轉移速度，亦稱為十億位元乙太網路。應理解：於其他實施例中，相機114被連接至該網路，使用其可具有比十億位元乙太網路更快或更慢的資料轉移速率之其他類型的網路連接。同時，於替代實施例中，一組相機可被直接地連接至各處理平台，且該些處理平台可被耦合至網路。FIG. 4 shows a plurality of cameras 412, 414, 416, ... 418 connected to the network. A large number of cameras can be deployed in specific systems. In one embodiment, cameras 412 to 418 are connected to network 481 using Ethernet-based connectors 422, 424, 426, and 428, respectively. In this embodiment, the Ethernet-based connector has a data transfer speed of one billion bits per second, which is also called a billion-bit Ethernet. It should be understood that in other embodiments, the camera 114 is connected to the network, using other types of network connections that may have faster or slower data transfer rates than gigabit Ethernet. Meanwhile, in an alternative embodiment, a group of cameras may be directly connected to the processing platforms, and the processing platforms may be coupled to a network.

儲存子系統430係儲存基本編程及資料架構，其提供本發明之某些實施例的功能。例如，實施複數影像辨識引擎之功能的各個模組可被儲存於儲存子系統430中。儲存子系統430為電腦可讀取記憶體之範例，其包含非暫態資料儲存媒體，具有儲存於記憶體中之電腦指令，其可由電腦所執行以履行文中所述之資料處理及影像處理的所有或任何組合，包括邏輯以：識別真實空間中之改變、追蹤主體及檢測真實空間之區域中的存貨項目之放下及取走，藉由如文中所述之程序。於其他範例中，電腦指令可被儲存於其他類型的記憶體中，包括可攜式記憶體，其包含可由電腦所讀取之非暫態資料儲存媒體或媒體。The storage subsystem 430 is a storage basic programming and data structure, which provides functions of some embodiments of the present invention. For example, each module implementing the function of the multiple image recognition engine may be stored in the storage subsystem 430. The storage subsystem 430 is an example of a computer-readable memory, which includes a non-transitory data storage medium having computer instructions stored in the memory, which can be executed by a computer to perform the data processing and image processing described in the text All or any combination, including logic to: identify changes in real space, track subjects, and drop and remove inventory items in areas of real space through procedures as described in the text. In other examples, computer instructions may be stored in other types of memory, including portable memory, which includes non-transitory data storage media or media that can be read by a computer.

這些軟體模組通常係由處理器子系統450所執行。主機記憶體子系統432通常包括數個記憶體，包括主隨機存取記憶體(RAM)434(用於程式執行期間儲存指令及資料)及唯讀記憶體(ROM)436(其中係儲存固定指令)。於一實施例中，RAM 434被使用為緩衝器，用以儲存來自其被連接至平台101a之相機114的視頻串流。These software modules are typically executed by the processor subsystem 450. The host memory subsystem 432 generally includes several memories, including a main random access memory (RAM) 434 (for storing instructions and data during program execution) and a read-only memory (ROM) 436 (for storing fixed instructions) ). In one embodiment, the RAM 434 is used as a buffer to store the video stream from the camera 114 which is connected to the platform 101a.

檔案儲存子系統440提供用於程式及資料檔案之持久儲存。於一範例實施例中，儲存子系統440包括四個120十億位元組(GB)固態硬碟(SSD)於RAID 0(獨立硬碟之冗餘陣列)配置(以數字442所識別)中。於範例實施例(其中CNN被用以識別主體之關節)中，RAID 0 442被用以儲存訓練資料。於訓練期間，其非於RAM 434中之訓練資料被讀取自RAID 0 442。類似地，當影像被記錄以供訓練之目的時，其非於RAM 434中之資料被儲存於RAID 0 442中。於範例實施例中，硬碟驅動(HDD)446為10兆位元組儲存。其具有比RAID 0 442儲存更慢的存取速度。固態硬碟(SSD)444含有作業系統及用於影像辨識引擎112a之相關檔案。The file storage subsystem 440 provides persistent storage for program and data files. In an exemplary embodiment, the storage subsystem 440 includes four 120 gigabyte (GB) solid state drives (SSDs) in a RAID 0 (redundant array of independent hard drives) configuration (identified by the number 442). . In the exemplary embodiment (where CNN is used to identify the joints of the subject), RAID 0 442 is used to store training data. During training, training data other than RAM 434 was read from RAID 0 442. Similarly, when an image is recorded for training purposes, its data not in RAM 434 is stored in RAID 0 442. In the exemplary embodiment, a hard disk drive (HDD) 446 is stored in 10 megabytes. It has a slower access speed than RAID 0 442 storage. A solid state drive (SSD) 444 contains an operating system and related files for the image recognition engine 112a.

於一範例組態中，三個相機412、414、及416被連接至處理平台101a。各相機具有專屬圖形處理單元GPU 1 462、GPU 2 464、及GPU 3 466，用以處理由相機所傳送的影像。應理解：少於或多於三個相機可被連接至每處理平台。因此，更少或更多的GPU被組態於網路節點中，以致其各相機具有專屬的GPU以處理接收自該相機之影像框。處理器子系統450、儲存子系統430及GPU 462、464、和466係使用匯流排子系統454來通訊。In an example configuration, three cameras 412, 414, and 416 are connected to the processing platform 101a. Each camera has a dedicated graphics processing unit GPU 1 462, GPU 2 464, and GPU 3 466 for processing images transmitted by the camera. It should be understood that less than or more than three cameras may be connected to each processing platform. Therefore, fewer or more GPUs are configured in the network node, so that each camera has a dedicated GPU to process the image frame received from the camera. The processor subsystem 450, the storage subsystem 430, and the GPUs 462, 464, and 466 all use a bus subsystem 454 to communicate.

數個周邊裝置(諸如網路介面子系統、使用者介面輸出裝置、及使用者介面輸入裝置)亦被連接至流排子系統454，其形成處理平台101a之部分。這些子系統及裝置被有意地未顯示於圖4中以增進說明之清晰。雖然流排子系統454被概略地顯示為單一匯流排，但匯流排子系統之替代實施例可使用多數匯流排。Several peripheral devices, such as a network interface subsystem, a user interface output device, and a user interface input device, are also connected to the streamline subsystem 454, which forms part of the processing platform 101a. These subsystems and devices are intentionally not shown in FIG. 4 to improve the clarity of the description. Although the busbar subsystem 454 is shown schematically as a single busbar, alternative embodiments of the busbar subsystem may use a majority of the busbars.

於一實施例中，相機412可使用Chameleon3 1.3 MP Color USB3 Vision (Sony ICX445)來實施，其具有1288x964之解析度、30 FPS之框率、及以每影像1.3百萬像素(MegaPixels)，利用具有300 - ∞之工作距離(mm)的變焦鏡頭、具有98.2˚ - 23.8˚之1/3”感應器的觀看域。卷積神經網路In an embodiment, the camera 412 may be implemented using Chameleon3 1.3 MP Color USB3 Vision (Sony ICX445), which has a resolution of 1288x964, a frame rate of 30 FPS, and a resolution of 1.3 megapixels per image (MegaPixels). 300-∞ zoom lens with working distance (mm), viewing field with 98.2 域-23.8˚ 1/3 "sensor. Convolutional neural network

處理平台中之影像辨識引擎係以預定速率接收影像之連續串流。於一實施例中，該些影像辨識引擎包含卷積神經網路(縮寫為CNN)。The image recognition engine in the processing platform receives a continuous stream of images at a predetermined rate. In one embodiment, the image recognition engines include a convolutional neural network (abbreviated as CNN).

圖5闡明藉由以數字500表示之CNN的影像框之處理。輸入影像510為由以列及行所配置之影像像素所組成的矩陣。於一實施例中，輸入影像510具有1280像素之寬度、720像素之高度及紅、藍、和綠(亦稱為RGB)之3頻道。該些頻道被想像為堆疊在彼此上之三個1280x720的二維影像。因此，輸入影像具有如圖5中所示之1280x720x 3的維度。FIG. 5 illustrates the processing of the image frame by the CNN represented by the number 500. The input image 510 is a matrix composed of image pixels arranged in columns and rows. In one embodiment, the input image 510 has a width of 1280 pixels, a height of 720 pixels, and three channels of red, blue, and green (also referred to as RGB). The channels are imagined as three 1280x720 two-dimensional images stacked on top of each other. Therefore, the input image has a dimension of 1280x720x3 as shown in FIG. 5.

2x2過濾器520係與輸入影像510卷積。於此實施例中，當過濾器與輸入卷積時無填補被應用。接續於此，非線性函數被應用至已卷積影像。於本實施例中，已校正的線性單元(ReLU)啟動被使用。非線性函數之其他範例包括S形(sigmoid)、雙曲線正切(tanh)及ReLU之變化，諸如漏ReLU。搜尋被履行以找出超參數值。超參數為C₁ , C₂ , ....., C_N ，其中C_N 表示卷積層「N」之頻道數。N及C之典型值被顯示於圖5。於CNN中有二十五(25)層，如由N等於25所表示。C之值為層1至25之各卷積層中的頻道數。於其他實施例中，額外特徵被加至CNN 500，諸如殘餘連接、擠壓激發模組、及多重解析度。The 2x2 filter 520 is convolved with the input image 510. In this embodiment, no padding is applied when the filter is convolved with the input. Following this, a non-linear function is applied to the convolved image. In this embodiment, the corrected linear unit (ReLU) activation is used. Other examples of non-linear functions include sigmoid, hyperbolic tangent (tanh), and changes in ReLU, such as leaking ReLU. The search is performed to find the hyperparameter values. The hyper-parameters are C ₁ , C ₂ , ....., C _N , where C _N represents the number of channels of the convolution layer "N". Typical values of N and C are shown in FIG. 5. There are twenty-five (25) layers in CNN, as represented by N equal to 25. The value of C is the number of channels in each of the convolutional layers of layers 1 to 25. In other embodiments, additional features are added to the CNN 500, such as residual connections, squeeze excitation modules, and multiple resolutions.

在用於影像分類之典型CNN中，影像之大小(寬度及高度維度)係隨著該影像通過卷積層被處理而被減小。其對於特徵識別是有幫助的，因為目標是預測輸入影像之類別。然而，於所示的實施例中，輸入影像之大小(亦即，影像寬度及高度維度)未被減小，因為其目標不僅是識別該影像框中之關節(亦稱為特徵)，而是同時亦識別該影像中之其位置，因此其可被映射至真實空間中之座標。因此，如圖5中所示，隨著該處理進行通過CNN之卷積層，該影像之寬度及高度維度維持不變，於此範例中。In a typical CNN used for image classification, the size (width and height dimensions) of an image is reduced as the image is processed through a convolutional layer. It is helpful for feature recognition because the goal is to predict the category of the input image. However, in the illustrated embodiment, the size (i.e., image width and height dimensions) of the input image is not reduced because the goal is not only to identify joints (also called features) in the image frame, but to It also identifies its location in the image, so it can be mapped to coordinates in real space. Therefore, as shown in FIG. 5, as the process proceeds through the CNN's convolutional layer, the width and height dimensions of the image remain unchanged, in this example.

於一實施例中，CNN 500識別符該影像之各元件上的該些主體之19個可能的關節。該些可能的關節可被群集為兩種類：足部關節及非足部關節。第19類型的關節類別是針對該主體之所有非關節特徵(亦即，未被歸類為關節之影像的元件)。　　足部關節：　　　　腳踝關節(左及右) 　　非足部關節：　　　　脖子　　　　鼻子　　　　眼睛(左及右) 　　　　耳朵(左及右) 　　　　肩膀(左及右) 　　　　手肘(左及右) 　　　　手腕(左及右) 　　　　臀部(左及右) 　　　　膝蓋(左及右) 　　不是關節In one embodiment, the CNN 500 identifies 19 possible joints of the subjects on each element of the image. These possible joints can be grouped into two categories: foot joints and non-foot joints. A joint type of type 19 is for all non-joint features of the subject (ie, elements that are not classified as joint images). Foot joints: Ankle joint (left and right) Non-foot joints: neck nose eyes (left and right) ear (left and right) shoulder (left and right) elbow (left and right) wrist (left and right) hip (Left and right) knee (left and right) not a joint

如可看出，為了本說明書之目的，「關節」是真實空間中之主體的可追蹤特徵。關節可相應於該些主體上之生理關節、或其他特徵(諸如眼睛、或鼻子)。As can be seen, for the purposes of this specification, a "joint" is a traceable feature of a subject in real space. The joint may correspond to a physiological joint on the subjects, or other features (such as eyes, or nose).

對於輸入影像之串流的第一組分析係識別真實空間中之主體的可追蹤特徵。於一實施例中，此被稱為「關節分析」。於此一實施例中，用於關節分析之CNN被稱為「關節CNN」。於一實施例中，關節分析被履行每秒三十次，在接收自相應相機之每秒三十框上。該分析被時間上同步化(亦即，一秒的1/30^th )，來自所有相機114之影像被分析於相應的關節CNN中以識別真實空間中之所有主體的關節。來自複數相機之來自單一時刻的影像之此分析的結果被儲存為「快照」。The first set of analyses for a stream of input images is to identify traceable features of a subject in real space. In one embodiment, this is called "joint analysis". In this embodiment, the CNN used for joint analysis is referred to as a "joint CNN". In one embodiment, joint analysis is performed thirty times per second on a thirty frame per second received from the corresponding camera. The analysis is synchronized in time (ie, 1/30 ^{th of a} second), and images from all cameras 114 are analyzed in the corresponding joint CNN to identify the joints of all subjects in real space. The results of this analysis of images from multiple cameras from multiple moments are stored as "snapshots".

快照可為來自某一時刻之所有相機114的影像之含有關節資料結構的陣列之字典的形式，其代表由該系統所覆蓋之真實空間的區域內之候選關節的群集。於一實施例中，快照被儲存於主體資料庫140中。The snapshot may be in the form of a dictionary containing an array of joint data structures from the images of all cameras 114 at a time, which represents a cluster of candidate joints in the area of real space covered by the system. In one embodiment, the snapshot is stored in the subject database 140.

於此範例CNN中，softmax函數被應用至卷積層530之最終層中的影像之每一元件。softmax函數係將任意實值之K維向量變換至範圍[0, 1](其向上加至1)中的實值之K維向量。於一實施例中，影像之元件為單一像素。softmax函數係將各像素的任意實值之19維陣列(亦稱為19維向量)轉換至範圍[0, 1](其向上加至1)中的實值之19維信心陣列。影像框中之像素的19維係相應於CNN之最終層中的19頻道，其進一步相應於該些主體之19個類型的關節。In this example CNN, the softmax function is applied to each element of the image in the final layer of the convolutional layer 530. The softmax function transforms an arbitrary real-valued K-dimensional vector into a real-valued K-dimensional vector in the range [0, 1] (which is increased up to 1). In one embodiment, the image element is a single pixel. The softmax function converts an arbitrary real-valued 19-dimensional array (also known as a 19-dimensional vector) of each pixel into a real-valued 19-dimensional confidence array in the range [0, 1] (which is added up to 1). The 19 dimensions of the pixels in the image frame correspond to the 19 channels in the final layer of the CNN, which further corresponds to the 19 types of joints of the subjects.

大量圖片元件可被分類為一影像中之19個類型的關節的各者之一，根據針對該影像之來源相機的觀看域中之主體的數目。A large number of picture elements can be classified as one of each of the 19 types of joints in an image, based on the number of subjects in the viewing domain of the camera from which the image originated.

影像辨識引擎112a-112n係處理影像以產生針對該影像之元件的信心陣列。影像之特定元件的信心陣列包括該特定元件之複數關節類型的信心值。影像辨識引擎112a-112n之每一者(各別地)產生每影像之信心陣列的輸出矩陣540。最後，各影像辨識引擎產生相應於每影像之信心陣列的各輸出矩陣540之關節資料結構的陣列。相應於特定影像之關節資料結構的陣列係藉由關節類型、特定影像之時間、及特定影像中之元件的座標來分類特定影像之元件。各影像中之特定元件的關節資料結構之關節類型係根據信心陣列之值來選擇。The image recognition engines 112a-112n process the image to generate a confidence array of components for the image. The confidence array of a particular component of the image includes confidence values for a plurality of joint types of the particular component. Each of the image recognition engines 112a-112n (respectively) generates an output matrix 540 of the confidence array for each image. Finally, each image recognition engine generates an array of joint data structures of each output matrix 540 corresponding to the confidence array of each image. The array of the joint data structure corresponding to a specific image is used to classify the components of a specific image by the joint type, the time of the specific image, and the coordinates of the components in the specific image. The joint type of the joint data structure of a specific component in each image is selected based on the value of the confidence array.

該些主體之各關節可被視為分佈於輸出矩陣540中而成為熱映圖。熱映圖可被解析以顯示具有針對各關節類型之最高值(峰值)的影像元件。理想地，針對具有特定關節類型之高值的既定圖片元件，在與該既定圖片元件之某一距離外的周遭圖片元件將具有針對該關節類型之較低的值，以致其具有該關節類型之特定關節的位置可被識別於影像空間座標中。相應地，該影像元件之信心陣列將具有針對該關節之最高信心值以及針對剩餘18個類型的關節之較低的信心值。The joints of the subjects can be regarded as being distributed in the output matrix 540 and become a heat map. Thermal maps can be parsed to display image elements with the highest value (peak) for each joint type. Ideally, for a given picture element with a high value for a particular joint type, the surrounding picture elements outside a certain distance from the given picture element will have a lower value for the joint type, so that it has the value of the joint type The position of a particular joint can be identified in the image space coordinates. Accordingly, the confidence array of the image element will have the highest confidence value for the joint and lower confidence values for the remaining 18 types of joints.

於一實施例中，來自各相機114之影像的批次係由各別影像辨識引擎所處理。例如，六個連續時戳影像被依序地處理於一批次中以善用快取同調性。針對CNN 500之一層的參數被載入記憶體中並應用於六個影像框之該批次。接著針對下一層的參數被載入記憶體中並應用於六個影像之該批次。此被重複於CNN 500中之所有卷積層530。快取同調性減少了處理時間並增進影像辨識引擎之性能。In one embodiment, the batches of images from each camera 114 are processed by respective image recognition engines. For example, six consecutive time-stamped images are sequentially processed in a batch to take advantage of cache coherence. The parameters for one layer of the CNN 500 are loaded into memory and applied to the batch of six image frames. The parameters for the next layer are then loaded into memory and applied to the batch of six images. This is repeated for all convolutional layers 530 in CNN 500. Cache coherence reduces processing time and improves the performance of the image recognition engine.

於一此類實施例中，針對三維(3D)卷積，CNN 500之性能的進一步增進係藉由共用橫跨該批次中之影像框的資訊來達成。如此係協助關節之更精確的識別並減少錯誤肯定。例如，影像框中之特徵(其中橫跨既定批次中之多數影像框的像素值不會改變)很可能是靜態物件(諸如貨架)。針對橫跨既定批次中之影像框的相同像素之值的改變係指示其此像素很可能是關節。因此，CNN 500可更專注於處理該像素以正確地識別其由該像素所識別的關節。關節資料結構In one such embodiment, for three-dimensional (3D) convolutions, the performance of the CNN 500 is further improved by sharing information across image frames in the batch. This is to assist in more accurate joint identification and reduce false positives. For example, a feature in an image frame (where the pixel value across most image frames in a given batch does not change) is likely to be a static object (such as a shelf). A change in the value of the same pixel across an image frame in a given batch indicates that this pixel is likely to be a joint. As a result, CNN 500 can focus more on processing the pixel to correctly identify its joints identified by the pixel. Joint data structure

CNN 500之輸出為針對每相機之各影像的信心陣列之矩陣。信心陣列之矩陣被變換為關節資料結構之陣列。如圖6中所示之關節資料結構被用以儲存各關節之資訊。關節資料結構600係識別相機(影像係從該相機所接收)之2D影像空間中的特定影像中的元件之x及y位置。關節數係識別其已識別的關節之類型。例如，於一實施例中，該些值的範圍係從1至19。1之值指示其該關節為左腳踝，2之值指示其該關節為右腳踝，依此類推。關節之類型係使用針對輸出矩陣540中之該元件的信心陣列來選擇。例如，於一實施例中，假如相應於左腳踝關節之值為針對該影像元件之信心陣列中的最高者，則該關節數之值為「1」。The output of the CNN 500 is a matrix of confidence arrays for each image of each camera. The matrix of the confidence array is transformed into an array of joint data structures. The joint data structure shown in FIG. 6 is used to store information of each joint. The joint data structure 600 identifies the x and y positions of the components in a specific image in the 2D image space of the camera (the image is received from the camera). The joint number system identifies the types of joints it has identified. For example, in one embodiment, the values range from 1 to 19. A value of 1 indicates that the joint is a left ankle, a value of 2 indicates that the joint is a right ankle, and so on. The type of joint is selected using a confidence array for that element in the output matrix 540. For example, in one embodiment, if the value corresponding to the left ankle joint is the highest in the confidence array for the image element, the value of the joint number is "1".

信心數係指示於預測該關節時之CNN 500中的信心之程度。假如信心數之值很高，則表示CNN對於其預測是有信心的。整數Id被指派給關節資料結構以獨特地識別它。接續於以上映射後，每影像之信心陣列的輸出矩陣540被轉換為各影像之關節資料結構的陣列。The confidence number indicates the degree of confidence in the CNN 500 when predicting the joint. If the value of the confidence number is high, it means that CNN is confident in its prediction. The integer Id is assigned to the joint data structure to uniquely identify it. Following the above mapping, the output matrix 540 of the confidence array of each image is converted into an array of joint data structures of each image.

影像辨識引擎112a-112n接收來自相機114之影像的序列並處理影像以產生如上所述之關節資料結構的相應陣列。針對特定影像之關節資料結構的陣列係藉由關節類型、特定影像之時間、及特定影像中之元件的座標來分類特定影像之元件。於一實施例中，影像辨識引擎112a-112n為卷積神經網路CNN 500，該關節類型為該些主體之19個類型的關節之一，特定影像之時間為由來源相機114針對該特定影像所產生的影像之時戳，而座標(x, y)係識別2D影像平面上之該元件的位置。The image recognition engines 112a-112n receive a sequence of images from the camera 114 and process the images to generate a corresponding array of joint data structures as described above. An array of joint data structures for a particular image is used to classify the components of a particular image by the type of joint, the time of the particular image, and the coordinates of the elements in the particular image. In an embodiment, the image recognition engines 112a-112n are convolutional neural network CNN 500, the joint type is one of the 19 types of joints of the subjects, and the specific image time is determined by the source camera 114 for the specific image The time stamp of the generated image, and the coordinates (x, y) identify the position of the component on the 2D image plane.

於一實施例中，關節分析包括履行k最近鄰居、高斯之混合、各種影像形態變換、及各輸入影像上之關節CNN的組合。該結果包含關節資料結構之陣列，其可被儲存以環緩衝器中之位元遮罩的形式，其係將影像數映射至各時刻的位元遮罩。追蹤引擎In one embodiment, joint analysis includes performing a combination of k nearest neighbors, Gaussian mixtures, various image morphological transformations, and joint CNNs on each input image. The result contains an array of joint data structures that can be stored in the form of bit masks in the ring buffer, which maps the number of images to the bit masks at each time. Tracking engine

追蹤引擎110係組態成接收由影像辨識引擎112a-112n所產生之關節資料結構的陣列，相應於來自具有重疊觀看域之相機的影像序列中之影像。每影像之關節資料結構的陣列係由影像辨識引擎112a-112n傳送至追蹤引擎110，經由如圖7中所示之網路181。追蹤引擎110將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。追蹤引擎110包含邏輯以將具有真實空間中之座標的候選關節之集合(關節之群集)識別為該真實空間中之主體。於一實施例中，追蹤引擎110係累積來自既定時刻之所有相機的影像辨識引擎之關節資料結構的陣列，並將此資訊儲存為主體資料庫140中之字典，以供用於識別候選關節之群集。該字典可被配置以密鑰-值對的形式，其中密鑰為相機id且值為來自該相機之關節資料結構的陣列。於此一實施例中，此字典被用於啟發法為基的分析以判定候選關節及關節之指派給主體。於此一實施例中，追蹤引擎110之高階輸入、處理及輸出被闡明於表1中。表1：來自一範例實施例中之追蹤引擎110的輸入、處理及輸出。將關節群集為候選關節The tracking engine 110 is configured to receive an array of joint data structures generated by the image recognition engines 112a-112n, corresponding to images in a sequence of images from cameras with overlapping viewing fields. The array of the joint data structure of each image is transmitted from the image recognition engines 112a-112n to the tracking engine 110 via the network 181 as shown in FIG. The tracking engine 110 transforms the coordinates of elements in an array of joint data structures corresponding to the images in different sequences into candidate joints with coordinates in real space. The tracking engine 110 includes logic to identify a set of candidate joints (cluster of joints) with coordinates in the real space as a subject in the real space. In one embodiment, the tracking engine 110 accumulates an array of joint data structures of image recognition engines from all cameras at a given time, and stores this information as a dictionary in the subject database 140 for identifying a cluster of candidate joints . The dictionary can be configured in the form of key-value pairs, where the key is the camera id and the value is an array of joint data structures from the camera. In this embodiment, the dictionary is used for heuristic-based analysis to determine candidate joints and the assignment of joints to the subject. In this embodiment, the high-level input, processing, and output of the tracking engine 110 are illustrated in Table 1. Table 1: Inputs, processing, and outputs from the tracking engine 110 in an example embodiment. Cluster joints as candidate joints

追蹤引擎110沿著兩維度以接收關節資料結構的陣列：時間及空間。沿著時間維度，追蹤引擎係依序地接收由每相機之影像辨識引擎112a-112n所處理的關節資料結構的有時戳陣列。關節資料結構包括在一段時間週期期間相同主體之相同關節的多數實例，於來自具有重疊觀看域之相機的影像中。特定影像中之元件的(x, y)座標在關節資料結構的依序有時戳陣列中通常將是不同的，由於特定關節所屬之主體的移動。例如，被歸類為左手腕關節之二十個圖片元件可出現在來自特定相機的許多依序有時戳影像中，各左手腕關節具有其可隨著影像而改變或不變的真實空間中之位置。結果，於許多關節資料結構的依序有時戳陣列中之二十個左手腕關節資料結構600可代表在某期間於真實空間中之相同的二十個關節。The tracking engine 110 receives an array of joint data structures along two dimensions: time and space. Along the time dimension, the tracking engine sequentially receives a time stamped array of joint data structures processed by the image recognition engines 112a-112n of each camera. The joint data structure includes most instances of the same joints of the same subject during a period of time, in images from cameras with overlapping viewing fields. The (x, y) coordinates of the components in a particular image will usually be different in the ordered time stamp array of the joint data structure due to the movement of the subject to which the particular joint belongs. For example, twenty picture elements classified as a left wrist joint can appear in many sequential and time-stamped images from a particular camera, each left wrist joint having a real space that can change or remain constant with the image Its location. As a result, the twenty left wrist joint data structures 600 in the sequential time-stamped array of many joint data structures may represent the same twenty joints in real space during a certain period.

因為具有重疊觀看域之多數相機係覆蓋真實空間中之各位置，所以在任何既定時刻，相同關節可出現在相機114之一個以上的影像中。相機114被時間上同步化，因此，追蹤引擎110係接收來自具有重疊觀看域之多數相機的特定關節之關節資料結構，於任何既定時刻。此為空間維度(兩個維度：時間及空間的第二個)，追蹤引擎110係沿著該空間維度以接收關節資料結構的陣列中之資料。Because most cameras with overlapping viewing fields cover locations in real space, the same joint can appear in more than one image of camera 114 at any given moment. The cameras 114 are synchronized in time, so the tracking engine 110 receives joint data structures from specific joints of most cameras with overlapping viewing fields at any given moment. This is a spatial dimension (the second of two dimensions: time and space), and the tracking engine 110 receives data in an array of joint data structures along the spatial dimension.

追蹤引擎110使用啟發法資料庫160中所儲存之啟發法的初始集合以識別來自關節資料結構的陣列之候選關節資料結構。其目標是在一段時間週期內將總體量度減至最小。總體量度計算器702係計算總體量度。總體量度為以下所述之多數值的總和。直覺地，總體量度之值在當如下情形下時是最小的：由追蹤引擎110沿著時間及空間維度所接收之關節資料結構的陣列中之關節被正確地指派給各別主體。例如，考量具有在走道中移動之消費者的購物商店之實施例。假如消費者A之左手腕被不正確地指派給消費者B，則總體量度之值將增加。因此，將各消費者之各關節的總體量度減至最小是一個最佳化問題。用以解決此問題之一選項是嘗試關節的所有可能連接。然而，此可能變為難處理的，隨著消費者之數目增加。The tracking engine 110 uses the initial set of heuristics stored in the heuristic database 160 to identify candidate joint data structures from an array of joint data structures. The goal is to minimize overall metrics over a period of time. The overall metric calculator 702 calculates an overall metric. The overall metric is the sum of the many values described below. Intuitively, the value of the overall metric is minimal when the joints in the array of joint data structures received by the tracking engine 110 along the temporal and spatial dimensions are correctly assigned to individual subjects. For example, consider an embodiment of a shopping store with consumers moving in the aisle. If Consumer A's left wrist is incorrectly assigned to Consumer B, the value of the overall metric will increase. Therefore, minimizing the overall measurement of each joint of each consumer is an optimization problem. One option to solve this problem is to try all possible connections of the joint. However, this can become difficult to handle as the number of consumers increases.

用以解決此問題之第二種方式是使用啟發法以減少其被識別為單一主體的候選關節之集合的成員之關節的可能組合。例如，左手腕關節不得屬於在空間中遠離一主體之其他關節的該主體，由於關節之相對位置的已知生理學特性。類似地，具有在位置上隨著影像的小改變之左手腕關節不太可能屬於具有來自在時間上遠離的影像之相同位置上的相同節點之主體，因為主體不被預期能以極高的速度移動。這些初始啟發法被用以建立時間及空間上的邊界，針對其可被歸類為特定主體之候選關節的群集。於特定時間及空間邊界內之關節資料結構中的關節被視為「候選關節」，以供指派給如真實空間中所出現之主體的候選關節之集合。這些候選關節包括於來自相同相機之來自多數影像的關節資料結構的陣列中所識別的關節，於一段時間週期(時間維度)並橫跨具有重疊觀看域之不同相機(空間維度)。足部關節The second way to solve this problem is to use heuristics to reduce the possible combinations of members of a member of the set of candidate joints that it identified as a single subject. For example, the left wrist joint must not belong to that subject in space away from other joints of a subject due to the known physiological characteristics of the relative positions of the joints. Similarly, a left wrist joint with a small change in position with the image is unlikely to belong to a subject with the same node from the same position in the image that is far away in time, because the subject is not expected to be able to move at extremely high speed mobile. These initial heuristics are used to establish temporal and spatial boundaries for clusters of candidate joints that can be classified as specific subjects. The joints in the joint data structure within a specific temporal and spatial boundary are considered "candidate joints" for the collection of candidate joints assigned to subjects as they appear in real space. These candidate joints include joints identified in an array of joint data structures from most images from the same camera, which span a period of time (time dimension) and span different cameras (spatial dimensions) with overlapping viewing domains. Foot joint

關節可被劃分以供一種將關節分組成為群集(成為如以上關節之列表中所示的足部和非足部關節)之程序的目的。於目前範例中之左及右腳踝關節類型被視為足部關節，以供此程序之目的。追蹤引擎110可開始使用足部關節以識別特定主體之候選關節的集合。於購物商店之實施例中，消費者之足部是在如圖2中所示之地板220上。相機114至地板220之距離是已知的。因此，當結合其來自相應於具有重疊觀看域之相機的影像之資料關節資料結構的陣列之足部關節的關節資料結構時，追蹤引擎110可假設一已知深度(沿著z軸之距離)。足部關節之值深度為零，亦即，真實空間之(x, y, z)座標系統中的(x, y, 0)。使用此資訊，影像追蹤引擎110係應用單應性映射以結合來自具有重疊觀看域之相機的足部關節之關節資料結構，以識別候選足部關節。使用此映射，於影像空間中之(x, y)座標中的關節之位置被轉換至真實空間中之(x, y, z)座標中的位置，導致候選足部關節。此程序被分離地履行以使用各別關節資料結構來識別候選左及右足部關節。Joints can be partitioned for the purpose of a procedure to group joints into clusters (become foot and non-foot joints as shown in the list of joints above). The left and right ankle joint types in the current example are considered foot joints for the purpose of this procedure. The tracking engine 110 may begin using the foot joints to identify a set of candidate joints for a particular subject. In the embodiment of the shopping store, the consumer's feet are on the floor 220 as shown in FIG. 2. The distance from the camera 114 to the floor 220 is known. Therefore, the tracking engine 110 can assume a known depth (distance along the z-axis) when combining the joint data structure of its foot joint from an array of data joint data structures corresponding to the images of cameras with overlapping viewing fields. . The depth of the value of the foot joint is zero, that is, (x, y, 0) in the (x, y, z) coordinate system of the real space. Using this information, the image tracking engine 110 applies homography mapping to combine the joint data structures of the foot joints from cameras with overlapping viewing fields to identify candidate foot joints. Using this mapping, the positions of the joints in (x, y) coordinates in image space are transformed to the positions in (x, y, z) coordinates in real space, resulting in candidate foot joints. This procedure is performed separately to identify candidate left and right foot joints using individual joint data structures.

接續於此，追蹤引擎110可結合候選左足部關節與候選右足部關節(將其指派給候選關節之集合)以產生主體。來自候選關節之星系的其他關節可被鏈結至該主體以建立該產生的主體之部分或所有關節類型的群集。Following this, the tracking engine 110 may combine the candidate left foot joint and the candidate right foot joint (assigned to a set of candidate joints) to generate a subject. Other joints from the candidate joint galaxy can be linked to the subject to build a cluster of some or all joint types of the resulting subject.

假如僅有一左候選足部關節及一右候選足部關節，則表示在該特定時間僅有一主體於該特定空間中。追蹤引擎110產生具有屬於其關節集合之左及右候選足部關節的新主體。該主體被存在主體資料庫140中。假如有多數候選左及右足部關節，則總體量度計算器702嘗試將各候選左足部關節結合至各候選右足部關節以產生主體以致其總體量度之值被減至最小。非足部關節If there is only one left candidate foot joint and one right candidate foot joint, it means that there is only one subject in the specific space at the specific time. The tracking engine 110 generates a new subject with left and right candidate foot joints belonging to its joint set. The subject is stored in the subject database 140. If there are a majority of candidate left and right foot joints, the overall metric calculator 702 attempts to combine each candidate left foot joint to each candidate right foot joint to generate a subject such that the value of its overall metric is minimized. Non-foot joints

為了識別來自特定時間及空間邊界內之關節資料結構的陣列之候選非足部關節，追蹤引擎110係使用從任何既定相機A至其相鄰相機B(具有重疊觀看域)之非線性變換(亦稱為基礎矩陣)。非線性變換係使用單一多關節主體來計算且被儲存於如上所述之調校資料庫170中。例如，針對具有重疊觀看域之兩個相機A及B，候選非足部關節被識別如下。在相應於來自相機A之影像框中的元件之關節資料結構的陣列中的非足部關節被映射至來自相機B之同步化影像框中的核線。由相機A的特定影像之關節資料結構的陣列中之關節資料結構所識別的關節(亦稱為機器視覺文獻中之特徵)將出現在相應的核線上，假如其出現在相機B之影像中的話。例如，假如來自相機A之關節資料結構中的關節為左手腕關節，則相機B之影像中的核線上之左手腕關節係代表來自相機B之觀點的相同左手腕關節。相機A及B之影像中的這兩個點為真實空間中之3D場景中的相同點之投影且被稱為「共軛對」。To identify candidate non-foot joints from an array of joint data structures within specific temporal and spatial boundaries, the tracking engine 110 uses a non-linear transformation (also from any given camera A to its neighboring camera B (with overlapping viewing domains) (also (Called the base matrix). The non-linear transformation is calculated using a single multi-joint body and stored in the calibration database 170 as described above. For example, for two cameras A and B with overlapping viewing fields, candidate non-foot joints are identified as follows. Non-foot joints in the array corresponding to the joint data structure of the components in the image frame from camera A are mapped to the epipolar lines in the synchronized image frame from camera B. The joints identified by the joint data structure in the array of joint data structures of the specific image of camera A (also known as features in the machine vision literature) will appear on the corresponding epipolar lines if they appear in the image of camera B . For example, if the joint in the joint data structure from camera A is the left wrist joint, the left wrist joint on the epipolar line in the image of camera B represents the same left wrist joint from the viewpoint of camera B. These two points in the images of cameras A and B are projections of the same points in a 3D scene in real space and are called "conjugate pairs".

機器視覺技術(諸如由Longuet-Higgins發佈於論文中之技術，名稱為「用以重建來自兩個投影之場景的電腦演算法」，於Nature, Volume 293，1981年九月10日)被應用至相應點之共軛對以判定於真實空間中距離地板220之關節的高度。上述方法之應用需要介於具有重疊觀看域之相機之間的預定映射。該資料被儲存在調校資料庫170中而成為於上述相機114之調校期間所判定的非線性函數。Machine vision technologies (such as those published in the paper by Longuet-Higgins, entitled "Computer Algorithms for Reconstructing Scenes from Two Projections", were applied in Nature, Volume 293, September 10, 1981) to The conjugate pairs of corresponding points determine the height of the joint from the floor 220 in the real space. The application of the above method requires a predetermined mapping between cameras with overlapping viewing fields. This data is stored in the calibration database 170 and becomes a non-linear function determined during the calibration of the camera 114 described above.

追蹤引擎110接收相應於來自具有重疊觀看域之相機的影像之序列中的影像之關節資料結構的陣列，並將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選非足部關節。已識別的候選非足部關節係使用總體量度計算器702而被群集為具有真實空間中之座標的主體之集合。總體量度計算器702計算總體量度值並嘗試藉由檢查非足部關節之不同組合以將該值減至最小。於一實施例中，該總體量度為組織於四個種類中之啟發法的總和。用以識別候選關節之集合的該邏輯包含根據真實空間中之主體的關節之間的物理關係之啟發函數，用以將候選關節之集合識別為主體。介於關節之間的物理關係之範例被考量於如下所述之啟發法中。第一種類的啟發法The tracking engine 110 receives an array of joint data structures corresponding to images in a sequence of images from cameras with overlapping viewing fields, and transforms the coordinates of elements in the array of joint data structures corresponding to images in different sequences into Candidate non-foot joints for coordinates in real space. The identified candidate non-foot joints are clustered into a collection of subjects with coordinates in real space using the overall metric calculator 702. The overall metric calculator 702 calculates an overall metric value and attempts to minimize the value by examining different combinations of non-foot joints. In one embodiment, the overall metric is the sum of heuristics organized in four categories. The logic used to identify the set of candidate joints includes a heuristic function based on the physical relationship between the joints of the subject in the real space to identify the set of candidate joints as the subject. An example of the physical relationship between joints is considered in the heuristics described below. Heuristics of the first kind

第一種類的啟發法包括量度，用以確定在相同或不同時刻於相同相機視角中介於兩個提議的主體-關節位置之間的相似度。於一實施例中，這些量度為浮點值，其中較高的值表示關節之兩列表極可能屬於相同主體。考量購物商店之範例實施例，該些量度係判定沿著時間維度從一影像至下一影像之於一相機中介於消費者的相同關節之間的距離。給定相機412之觀看域中的相機A，第一組量度係判定從來自相機412之一影像至來自相機412之下一影像的介於人A之關節的各者之間的距離。該些量度被應用至來自相機114的每影像之關節資料結構的陣列中之關節資料結構600。The first type of heuristics includes measurements to determine the similarity between two proposed subject-joint positions in the same camera perspective at the same or different times. In one embodiment, these measures are floating point values, where higher values indicate that the two lists of joints are likely to belong to the same subject. Considering an exemplary embodiment of a shopping store, these measures determine the distance between the same joints of a consumer in a camera from one image to the next image along the time dimension. Given camera A in the viewing domain of camera 412, the first set of measurements determines the distance between each of the joints of person A from an image from camera 412 to an image below camera 412. These measures are applied to the joint data structure 600 in an array of joint data structures for each image from the camera 114.

於一實施例中，第一種類的啟發法中之兩個範例量度被列出於下：　　1. 介於地板上之兩個主體的左腳踝關節與地板上之兩個主體的右腳踝關節之間的歐幾里德2D座標距離之倒數(使用針對來自特定相機之特定影像的x, y座標值)係加總在一起。　　2. 介於影像框中之主體的每一對非足部關節之間的歐幾里德2D座標距離之總和。第二種類的啟發法In one embodiment, two example metrics in the first type of heuristic are listed below: 1. Between the left ankle joint of two subjects on the floor and the right ankle joint of two subjects on the floor The reciprocal of the Euclidean 2D coordinate distances between them (using the x, y coordinate values for a particular image from a particular camera) is added together. 2. The sum of the Euclidean 2D coordinate distances between each pair of non-foot joints of the subject in the image frame. Heuristics of the second kind

第二種類的啟發法包括量度，用以確定在相同時刻介於來自多數相機之觀看域的兩個提議的主體-關節位置之間的相似度。於一實施例中，這些量度為浮點值，其中較高的值表示關節之兩列表極可能屬於相同主體。考量購物商店之範例實施例，第二組量度係判定在相同時刻之來自二或更多相機(具有重疊觀看域)的影像框中介於消費者的相同關節之間的距離。The second type of heuristics includes measurements to determine the similarity between the two proposed subject-joint positions from the viewing field of most cameras at the same time. In one embodiment, these measures are floating point values, where higher values indicate that the two lists of joints are likely to belong to the same subject. Considering the exemplary embodiment of the shopping store, the second set of measurements determines the distance between the same joints of the consumer in the image frames from two or more cameras (with overlapping viewing fields) at the same time.

於一實施例中，第二種類的啟發法中之兩個範例量度被列出於下：　　1. 介於地板上之兩個主體的左腳踝關節與地板上之兩個主體的右腳踝關節之間的歐幾里德2D座標距離之倒數(使用針對來自特定相機之特定影像的x, y座標值)係加總在一起。第一主體之腳踝關節位置被投影至相機，其中第二主體通過單應性映射為可見的。　　2. 介於一線與一點之間的歐幾里德2D座標之倒數的所有對關節之總和，其中該線為從具有第一主體於其觀看域中之第一相機至具有第二主體於其觀看域中之第二相機的影像之關節的核線，而該點為來自第二相機之影像中的第二主體之關節。第三種類的啟發法In one embodiment, two example metrics of the second type of heuristic are listed below: 1. Between the left ankle joint of two subjects on the floor and the right ankle joint of two subjects on the floor The reciprocal of the Euclidean 2D coordinate distances between them (using the x, y coordinate values for a particular image from a particular camera) is added together. The ankle joint position of the first subject is projected to the camera, where the second subject is made visible by homography mapping. 2. The sum of all pairs of joints of the inverse of Euclidean 2D coordinates between a line and a point, where the line is from the first camera with the first subject in its viewing field to the second camera with the second subject in it The epipolar line of the joint of the image of the second camera in the viewing domain, and the point is the joint of the second subject in the image of the second camera. The third kind of heuristics

第三種類的啟發法包括量度，用以確定在相同時刻於相同相機視角中介於提議的主體-關節位置的所有關節之間的相似度。考量購物商店之範例實施例，此種類的量度係判定在來自一相機之一框中介於消費者的關節之間的距離。第四種類的啟發法A third type of heuristics includes measurements to determine the similarity between all joints between the proposed subject-joint position in the same camera perspective at the same time. Considering an exemplary embodiment of a shopping store, this type of measurement determines the distance between the joints of the consumer in a frame from a camera. The fourth kind of heuristics

第四種類的啟發法包括量度，用以確定介於提議的主體-關節位置之間的相異度。於一實施例中，這些量度為浮點值。較高的值表示關節之兩列表更可能不是相同的主體。於一實施例中，此種類中之兩範例量度包括：　　1. 介於兩個提議的主體的脖子關節之間的距離。　　2. 介於兩主體之間的介於多對關節之間的距離之總和。The fourth type of heuristics includes measurements to determine the degree of disparity between the proposed subject-joint positions. In one embodiment, these metrics are floating point values. Higher values indicate that the two lists of joints are more likely not to be the same subject. In an embodiment, two exemplary measures in this category include: 1. The distance between the neck joints of two proposed subjects. 2. The sum of the distance between two pairs of joints between two subjects.

於一實施例中，其可憑經驗地被判定之各個臨限值被應用至以上列出的量度如下所述：　　1. 臨限值，用以決定量度值何時夠小以考量其一關節屬於一已知主體。　　2. 臨限值，用以判定何時有太多潛在的候選主體，其一關節可屬於具有太好的量度相似度分數。　　3. 臨限值，用以判定關節之集合何時具有夠高的量度相似度以被視為新主體，先前未出現在真實空間中。　　4. 臨限值，用以判定主體何時不再位於真實空間中。　　5. 臨限值，用以判定追蹤引擎110何時已產生錯誤並已混淆兩主體。In one embodiment, the thresholds that can be determined empirically are applied to the metrics listed above: 1. Thresholds are used to determine when a metric value is small enough to consider whether a joint belongs to A known subject. 2. Threshold value, which is used to determine when there are too many potential candidate subjects. One joint may belong to a score with too good measure similarity. 3. Threshold value, used to determine when the set of joints has a sufficiently high degree of similarity to be considered as a new subject, which has not previously appeared in real space. 4. Threshold value to determine when the subject is no longer in real space. 5. Threshold value, used to determine when the tracking engine 110 has made an error and has confused the two subjects.

追蹤引擎110包括用以儲存其被識別為主體之關節的集合之邏輯。用以識別候選關節之集合的邏輯包括邏輯，用以判定在特定時間所取得之影像中所識別的候選關節是否符合其被識別為先前影像中之主體的候選關節之該些集合之一的成員。於一實施例中，追蹤引擎110係於規律的間隔比較主體之目前的關節位置與該相同主體之先前記錄的關節位置。此比較容許追蹤引擎110更新該真實空間中之主體的關節位置。此外，使用此方式，追蹤引擎110識別錯誤肯定(亦即，錯誤識別的主體)並移除其不再出現於該真實空間中之主體。The tracking engine 110 includes logic to store a collection of joints that it identifies as a subject. The logic used to identify a set of candidate joints includes logic to determine whether a candidate joint identified in an image acquired at a particular time matches a member of one of the sets of candidate joints that it identified as a subject in a previous image . In one embodiment, the tracking engine 110 compares the current joint position of the subject with the previously recorded joint position of the same subject at regular intervals. This comparison allows the tracking engine 110 to update the joint position of the subject in the real space. In addition, using this approach, the tracking engine 110 identifies false positives (ie, misidentified subjects) and removes subjects that no longer appear in the real space.

考量購物商店實施例之範例，其中追蹤引擎110係於較早時刻產生消費者(主體)，然而，在某時間後，追蹤引擎110不具有該特定消費者之目前關節位置。其表示消費者被不正確地產生。追蹤引擎110從主體資料庫140刪除不正確地產生的主體。於一實施例中，追蹤引擎110亦使用上述程序以從真實空間移除肯定地識別的主體。考量購物商店之範例，當消費者離開購物商店時，追蹤引擎110便從主體資料庫140刪除相應的消費者記錄。於一此類實施例中，追蹤引擎110更新主體資料庫140中之此消費者的記錄以指示其「消費者已離開該商店」。Consider the example of the shopping store embodiment, in which the tracking engine 110 generates the consumer (subject) at an earlier time, however, after a certain time, the tracking engine 110 does not have the current joint position of the specific consumer. It indicates that the consumer was incorrectly generated. The tracking engine 110 deletes incorrectly generated subjects from the subject database 140. In one embodiment, the tracking engine 110 also uses the above procedure to remove positively identified subjects from real space. Considering the example of a shopping store, when a consumer leaves the shopping store, the tracking engine 110 deletes the corresponding consumer record from the main body database 140. In one such embodiment, the tracking engine 110 updates the consumer's record in the main body database 140 to indicate that "the consumer has left the store."

於一實施例中，追蹤引擎110嘗試藉由同時地應用足部及非足部啟發法以識別主體。如此導致該些主體之連接關節的「島」。隨著追蹤引擎110沿著時間及空間維度處理關節資料結構的進一步陣列，島的大小增加。最終地，關節之島合併至關節之其他島以形成主體，其被接著儲存於主體資料庫140中。於一實施例中，追蹤引擎110係維持未指派關節之記錄於一段預定的時間週期。於此時間期間，追蹤引擎嘗試將未指派關節指派給現存主體或者從這些未指派關節產生新的多關節單體。追蹤引擎110在一段預定的時間週期後丟棄該些未指派關節。應理解：於其他實施例中，除了以上所列出之外的不同啟發法被用以識別並追蹤主體。In one embodiment, the tracking engine 110 attempts to identify the subject by applying both foot and non-foot heuristics simultaneously. This leads to "islands" of the joints of these subjects. As the tracking engine 110 processes further arrays of joint data structures along time and space dimensions, the size of the island increases. Finally, the island of the joint is merged into the other islands of the joint to form the subject, which is then stored in the subject database 140. In one embodiment, the tracking engine 110 maintains a record of unassigned joints for a predetermined period of time. During this time, the tracking engine attempts to assign unassigned joints to existing subjects or generate new multi-joint cells from these unassigned joints. The tracking engine 110 discards the unassigned joints after a predetermined period of time. It should be understood that in other embodiments, different heuristics than those listed above are used to identify and track subjects.

於一實施例中，連接至主控追蹤引擎110之節點102的使用者介面輸出裝置係顯示該真實空間中之各主體的位置。於一此類實施例中，輸出裝置之顯示係於規律的間隔被再新以該些主體的新位置。主體資料結構In one embodiment, the user interface output device of the node 102 connected to the master tracking engine 110 displays the position of each subject in the real space. In one such embodiment, the display of the output device is renewed at regular intervals to the new positions of the subjects. Main data structure

主體之關節係使用上述的量度而被彼此連接。於如此做時，追蹤引擎110產生新主體並藉由更新其各別的關節位置以更新現存主體之位置。圖8顯示用以儲存主體之主體資料結構800。資料結構800將主體相關的資料儲存為密鑰-值字典。該密鑰為框_數而值為另一密鑰-值字典，其中密鑰為相機_id而值為(主體的)18個關節的列表，具有真實空間中之其位置。該主體資料被儲存在主體資料庫140中。每一新主體亦被指派獨特的識別符，其被用以存取主體資料庫140中之該主體的資料。The joints of the main body are connected to each other using the above-mentioned measures. In doing so, the tracking engine 110 generates new subjects and updates the positions of existing subjects by updating their respective joint positions. FIG. 8 shows a main data structure 800 for storing a main body. The data structure 800 stores data related to the subject as a key-value dictionary. The key is a box_number and the value is another key-value dictionary, where the key is the camera_id and the value is a list of 18 joints (of the subject), with its position in real space. The subject data is stored in the subject database 140. Each new subject is also assigned a unique identifier that is used to access the subject's data in the subject database 140.

於一實施例中，系統係識別主體之關節並產生該主體之骨骼。該骨骼被投影入真實空間以指示真實空間中之該主體的位置及定向。此亦被稱為機器視覺之領域中的「姿勢估計」。於一實施例中，系統將真實空間中之主體的定向及位置顯示於圖形使用者介面(GUI)上。於一實施例中，影像分析是匿名的，亦即，透過關節分析所產生之指派給主體的獨特識別符並不會識別真實空間中之任何特定主體的個人身份細節(諸如名字、電子郵件地址、住址、信用卡號碼、銀行帳戶號碼、駕照號碼，等等)。主體追蹤之程序流In one embodiment, the system recognizes the joints of the subject and generates the skeleton of the subject. The bone is projected into the real space to indicate the position and orientation of the subject in the real space. This is also called "posture estimation" in the field of machine vision. In one embodiment, the system displays the orientation and position of the subject in the real space on a graphical user interface (GUI). In one embodiment, the image analysis is anonymous, that is, the unique identifier assigned to the subject generated by the joint analysis does not identify the personal identity details (such as name, email address) of any particular subject in real space , Address, credit card number, bank account number, driver's license number, etc.). Program flow

闡明邏輯之數個流程圖被描述於文中。邏輯可被實施為：使用如上所述而組態之處理器，其係使用儲存在由該些處理器可存取且可執行之記憶體中的電腦程式來編程；以及於其組態中，係藉由專屬邏輯硬體(包括場可編程積體電路)、及藉由專屬邏輯硬體與電腦程式之組合。利用文中之所有流程圖，應理解：許多步驟可被結合、被平行地履行、或被履行於不同的序列中，而不影響所達成的功能。於某些情況下，如讀者所將理解：步驟之重新配置將達成相同的結果，僅當某些其他改變亦被同時執行時。於其他情況下，如讀者所將理解：步驟之重新配置將達成相同的結果，僅當某些條件被滿足時。再者，應理解：文中之流程圖僅顯示其有關於實施例之理解的步驟，且應瞭解：用以完成其他功能之各種其他步驟可被履行在那些所顯示者之前、之後及之間。Several flowcharts illustrating logic are described in the text. Logic may be implemented as: using a processor configured as described above, which is programmed using a computer program stored in a memory accessible and executable by those processors; and in its configuration, By using dedicated logic hardware (including field programmable integrated circuit), and by combining dedicated logic hardware and computer programs. With all flowcharts in the text, it should be understood that many steps can be combined, performed in parallel, or performed in different sequences without affecting the functions achieved. In some cases, as the reader will understand: the reconfiguration of steps will achieve the same result, only when some other changes are performed simultaneously. In other cases, as the reader will understand: the reconfiguration of steps will achieve the same result, only when certain conditions are met. Furthermore, it should be understood that the flowchart in the text only shows the steps which have an understanding of the embodiment, and it should be understood that various other steps to perform other functions can be performed before, after and between those shown.

圖9為流程圖，其闡明用以追蹤主體的程序步驟。該程序開始於步驟902。具有真實空間中之區域的觀看域之相機114被調校於程序步驟904。視頻程序係由影像辨識引擎112a-112n所履行於步驟906。於一實施例中，視頻程序被履行於每相機以處理從各別相機所接收之影像框的批次。來自各別影像辨識引擎112a-112n之所有視頻程序的輸出被提供為由追蹤引擎110所履行之場景程序的輸入於步驟908。場景程序識別新主體並更新現存主體之關節位置。於步驟910，檢查是否有更多影像框待處理。假如有更多影像框，則該程序於步驟906繼續，否則該程序於步驟914結束。FIG. 9 is a flowchart illustrating the procedure steps for tracking a subject. The process starts at step 902. The camera 114 with a viewing field having a region in the real space is adjusted at step 904. The video program is executed by the image recognition engines 112a-112n in step 906. In one embodiment, a video program is executed on each camera to process batches of image frames received from the respective cameras. The output of all video programs from the respective image recognition engines 112a-112n is provided as input to a scene program performed by the tracking engine 110 in step 908. The scene program recognizes the new subject and updates the joint position of the existing subject. In step 910, it is checked whether there are more image frames to be processed. If there are more image frames, the process continues at step 906, otherwise the process ends at step 914.

程序步驟904「調校真實空間中之相機」之更詳細的程序步驟被提出於圖10之流程圖中。調校程序開始於步驟1002，藉由識別真實空間之(x, y, z)座標的(0, 0, 0)點。於步驟1004，具有位置(0, 0, 0)於其觀看域中之第一相機被調校。相機調校之更多細節係較早被提出於本申請案中。於步驟1006，具有與第一相機之重疊觀看域的下一相機被調校。於步驟1008，檢查是否有更多相機待調校。該程序被重複於步驟1006直到所有相機114均被調校。A more detailed procedure of the procedure step 904 "Tuning the camera in real space" is presented in the flowchart of FIG. 10. The calibration procedure starts at step 1002 by identifying the (0, 0, 0) point of the (x, y, z) coordinates of the real space. At step 1004, the first camera having the position (0, 0, 0) in its viewing domain is adjusted. More details on camera calibration were previously filed in this application. At step 1006, the next camera having an overlapping viewing field with the first camera is adjusted. In step 1008, it is checked whether there are more cameras to be adjusted. The process is repeated at step 1006 until all cameras 114 are calibrated.

於下一程序步驟1010中，主體被引入真實空間中以識別介於具有重疊觀看域之相機之間的對應點之共軛對。此程序之某些細節被描述於上。該程序係針對每一對重疊相機而被重複於步驟1012。假如沒有更多相機則該程序結束(步驟1014)。In next program step 1010, the subject is introduced into real space to identify conjugate pairs of corresponding points between cameras with overlapping viewing fields. Some details of this procedure are described above. The process is repeated at step 1012 for each pair of overlapping cameras. If there are no more cameras, the process ends (step 1014).

圖11中之流程圖顯示「視頻程序」步驟906之更詳細的步驟。於步驟1102，每相機之k連續有時戳影像被選擇為一批次以供進一步處理。於一實施例中，k=6之值係根據網路節點101a-101n中之視頻程序的可用記憶體來計算，該些網路節點101a-101n係各別地主控影像辨識引擎112a-112n。於下一步驟1104中，影像之大小被設為適當尺寸。於一實施例中，影像具有1280像素之寬度、702像素之高度及三個頻道RGB(代表紅、綠及藍色)。於步驟1106，複數經訓練的卷積神經網路(CNN)係處理該些影像並產生每影像之關節資料結構的陣列。CNN之輸出為每影像之關節資料結構的陣列(步驟1108)。此輸出被傳送至場景程序，於步驟1110。The flowchart in Figure 11 shows more detailed steps of step 906 of the "video program". At step 1102, k consecutive timestamped images per camera are selected as a batch for further processing. In an embodiment, the value of k = 6 is calculated according to the available memory of the video program in the network nodes 101a-101n, and these network nodes 101a-101n are the main control image recognition engines 112a-112n. . In the next step 1104, the size of the image is set to an appropriate size. In one embodiment, the image has a width of 1280 pixels, a height of 702 pixels, and three channels of RGB (representing red, green, and blue). At step 1106, a complex trained convolutional neural network (CNN) processes the images and generates an array of joint data structures for each image. The output of the CNN is an array of joint data structures per image (step 1108). This output is passed to the scene program at step 1110.

圖12A為流程圖，其顯示圖9中之「場景程序」步驟908的更詳細步驟之第一部分。場景程序係結合來自多數視頻程序之輸出，於步驟1202。於步驟1204，檢查關節資料結構係識別足部關節或者非足部關節。假如該關節資料結構屬於足部關節，則單應性映射被應用以結合相應於來自具有重疊觀看域之相機的影像之關節資料結構，於步驟1206。此程序識別候選足部關節(左及右足部關節)。於步驟1208，啟發法被應用於步驟1206中所識別的候選足部關節上以將候選足部關節之集合識別為主體。於步驟1210檢查候選足部關節之該集合是否屬於現存主體。假如為否，則新主體被產生於步驟1212。否則，該現存主體被更新於步驟1214。FIG. 12A is a flowchart showing the first part of the more detailed steps of the “scene program” step 908 in FIG. 9. The scene program combines the output from most video programs in step 1202. In step 1204, the joint data structure is checked to identify a foot joint or a non-foot joint. If the joint data structure belongs to a foot joint, the homography mapping is applied to combine the joint data structure corresponding to the images from the cameras with overlapping viewing fields, at step 1206. This program identifies candidate foot joints (left and right foot joints). At step 1208, a heuristic is applied to the candidate foot joints identified in step 1206 to identify the set of candidate foot joints as the subject. In step 1210, it is checked whether the set of candidate foot joints belongs to an existing subject. If not, a new subject is generated in step 1212. Otherwise, the existing subject is updated at step 1214.

流程圖12B係顯示「場景程序」步驟908的更詳細步驟之第二部分。於步驟1240，非足部關節之資料結構被結合自相應於來自具有重疊觀看域之相機的影像之序列中的影像之關節資料結構的多數陣列。此係藉由以下方式來履行：將來自第一相機之來自第一影像的對應點映射至來自具有重疊觀看域之第二相機的第二影像。此程序之某些細節被描述於上。於步驟1242，啟發法被應用至候選非足部關節。於步驟1246，判定候選非足部關節是否屬於現存主體。假如是的話，該現存主體被更新於步驟1248。否則，該候選非足部關節被再次處理於步驟1250(在一段預定時間後)以使其與現存主體匹配。於步驟1252，檢查該非足部關節是否屬於現存主體。假如是的話，該主體被更新於步驟1256。否則，該關節被丟棄於步驟1254。Flowchart 12B is the second part showing more detailed steps of "Scene Program" step 908. At step 1240, the non-foot joint data structure is combined from a majority array of joint data structures corresponding to images in a sequence of images from cameras with overlapping viewing fields. This is accomplished by mapping corresponding points from the first camera to the second image from the second camera with overlapping viewing domains. Some details of this procedure are described above. At step 1242, a heuristic is applied to candidate non-foot joints. In step 1246, it is determined whether the candidate non-foot joint belongs to the existing subject. If so, the existing subject is updated at step 1248. Otherwise, the candidate non-foot joint is processed again at step 1250 (after a predetermined time) to match it with the existing subject. In step 1252, it is checked whether the non-foot joint belongs to an existing subject. If so, the subject is updated at step 1256. Otherwise, the joint is discarded at step 1254.

於一範例實施例中，用以識別新主體、追蹤主體及去除主體(其已離開真實空間或者被不正確地產生)之程序被實施為由運行時間系統(亦稱為推理系統)所履行的「單體內聚演算法」之部分。單體是以上被稱為主體之關節的群集。單體內聚演算法係識別真實空間中之單體並更新真實空間中之關節的位置以追蹤單體之移動。In an exemplary embodiment, the procedures for identifying new subjects, tracking subjects, and removing subjects (which have left real space or were incorrectly generated) are implemented as performed by a runtime system (also known as an inference system) Part of the "monolithic cohesion algorithm". Monoliths are clusters of joints referred to above as the subject. The monomer cohesion algorithm recognizes the monomers in the real space and updates the positions of the joints in the real space to track the movement of the monomers.

圖14提出視頻程序1411及場景程序1415之圖示。於所示的實施例中，顯示四個視頻程序，各處理來自一或更多相機114之影像。視頻程序係處理如上所述之影像並識別每框之關節。於一實施例中，各視頻程序識別2D座標、信心數、關節數及獨特ID，針對每框每關節。所有視頻程序之輸出1452被提供為輸入1453至場景程序1415。於一實施例中，場景程序產生每時刻之關節密鑰-值字典，其中該密鑰為相機識別符而該值為關節之陣列。該些關節被再投影入具有重疊觀看域之相機的觀點。再投影的關節被儲存為密鑰-值字典，並可被用以產生針對各相機中之各影像的前台主體遮罩，如以下所討論。此字典中之密鑰為關節id與相機id之組合。該字典中之值為其被再投影入目標相機之觀點的關節之2D座標。FIG. 14 presents an illustration of a video program 1411 and a scene program 1415. In the illustrated embodiment, four video programs are shown, each processing images from one or more cameras 114. The video program processes the images described above and identifies the joints in each frame. In one embodiment, each video program recognizes the 2D coordinates, the number of confidences, the number of joints, and the unique ID for each frame and each joint. The output 1452 of all video programs is provided as input 1453 to the scene program 1415. In one embodiment, the scene program generates a joint key-value dictionary for each moment, where the key is a camera identifier and the value is an array of joints. The joints are reprojected into the perspective of a camera with overlapping viewing fields. The re-projected joints are stored as a key-value dictionary and can be used to generate a foreground subject mask for each image in each camera, as discussed below. The key in this dictionary is the combination of joint id and camera id. The values in this dictionary are the 2D coordinates of the joints that are reprojected into the viewpoint of the target camera.

場景程序1415產生輸出1457，其包含在某一時刻之真實空間中的所有主體之列表。該列表包括每主體之密鑰-值字典。該密鑰為主體之獨特識別符而該值為另一密鑰-值字典，以該密鑰為框數而該值為相機-主體關節密鑰-值字典。相機-主體關節密鑰-值字典為每主體字典，其中該密鑰為相機識別符而該值為關節之列表。用以識別並追蹤每主體之存貨項目的影像分析The scene program 1415 produces an output 1457 which contains a list of all subjects in real space at a certain moment. The list includes a key-value dictionary for each subject. The key is the unique identifier of the subject and the value is another key-value dictionary, the key is the number of boxes, and the value is the camera-subject joint key-value dictionary. The camera-subject joint key-value dictionary is a per-subject dictionary, where the key is a camera identifier and the value is a list of joints. Image analysis to identify and track inventory items for each subject

用以追蹤真實空間之區域中藉由主體之存貨項目的放下及取走之系統及各種實施方式係參考圖15A至25而被描述。系統及程序係參考圖15A而被描述，依據實施方式之系統的架構階概圖。因為圖15A為架構圖，所以某些細節被省略以增進描述之清晰。多CNN管線之架構The system and various embodiments for tracking down and taking off of the inventory item of the subject in the area of the real space are described with reference to FIGS. 15A to 25. The system and program are described with reference to FIG. 15A, which is a schematic diagram of the architecture of the system according to an embodiment. Since FIG. 15A is an architectural diagram, some details are omitted to improve the clarity of the description. Architecture of multiple CNN pipelines

圖15A為卷積神經網路之管線(亦稱為多CNN管線)的高階架構，其處理從相機114所接收之影像框以產生真實空間中之各主體的購物車資料結構。文中所述之系統包括如上所述之每相機影像辨識引擎，用以識別並追蹤多關節主體。替代的影像辨識引擎可被使用，包括其中僅有一「關節」被辨識並追蹤於每個體之範例，或者涵蓋空間及時間之其他特徵或其他類型的影像資料被利用以辨識並追蹤其被處理的真實空間中之主體。FIG. 15A is a high-level architecture of a pipeline of a convolutional neural network (also known as a multi-CNN pipeline), which processes the image frames received from the camera 114 to generate a shopping cart data structure of each subject in the real space. The system described herein includes a per-camera image recognition engine as described above to identify and track multi-joint subjects. Alternative image recognition engines can be used, including examples where only one "joint" is recognized and tracked in each volume, or other features covering space and time or other types of image data are used to identify and track the processed Subject in real space.

多CNN管線係平行地運行於每相機，從各別相機移動影像至影像辨識引擎112a-112n，經由每相機之循環緩衝器1502。於一實施例中，該系統係由三個子系統所組成：第一影像處理器子系統2602、第二影像處理器子系統2604及第三影像處理器子系統2606。於一實施例中，第一影像處理器子系統2602包括影像辨識引擎112a-112n，其被實施為卷積神經網路(CNN)且被稱為關節CNN 112a-112n。如相關於圖1所述，相機114可於時間上被彼此同步化，以致其影像被同時地(或時間上接近地)並以相同的影像擷取率來擷取。於其同時地(或時間上接近地)覆蓋真實空間之區域的所有相機中所擷取的影像被同步化，由於其同步化影像可被識別於處理引擎中如代表在具有真實空間中之固定位置的主體之某一時刻的不同視角。The multiple CNN pipeline runs on each camera in parallel, moving images from the respective cameras to the image recognition engines 112a-112n, and passing through each camera's circular buffer 1502. In one embodiment, the system is composed of three subsystems: a first image processor subsystem 2602, a second image processor subsystem 2604, and a third image processor subsystem 2606. In one embodiment, the first image processor subsystem 2602 includes an image recognition engine 112a-112n, which is implemented as a convolutional neural network (CNN) and is referred to as a joint CNN 112a-112n. As described in relation to FIG. 1, the cameras 114 may be synchronized with each other in time such that their images are captured simultaneously (or temporally close) and at the same image capture rate. The images captured by all the cameras that simultaneously (or close in time) cover the area of real space are synchronized, because the synchronized images can be identified in the processing engine, such as representing a fixed in real space Different perspectives of the subject at a moment in position.

於一實施例中，相機114被安裝於購物商店(諸如超級市場)中以致其具有重疊觀看域之相機(二或更多)的集合被置於各走道上方以擷取該商店中之真實空間的影像。有N個相機於真實空間中，然而，為了簡化，僅有一相機被顯示於圖17A中為相機(i )，其中i之值的範圍係從1至N。各相機產生相應於其各別觀看域之真實空間的影像之序列。In one embodiment, the camera 114 is installed in a shopping store (such as a supermarket) so that its collection of cameras (two or more) with overlapping viewing areas is placed above each aisle to capture the real space in the store Image. There are N cameras in real space, however, for simplicity, only one camera is shown as the camera ( i ) in FIG. 17A, where the value of i ranges from 1 to N. Each camera generates a sequence of images corresponding to the real space of its respective viewing field.

於一實施例中，相應於來自各相機之影像的序列之影像框係以每秒30框(fps)之速率被傳送至各別影像辨識引擎112a-112n。各影像框具有時戳、相機之識別(縮寫為「相機_id」)、及框識別(縮寫為「框_id」)，連同影像資料。影像框被儲存於每相機114之循環緩衝器1502(亦稱為環緩衝器)中。循環緩衝器1502儲存來自各別相機114之連續有時戳影像框之集合。In one embodiment, the image frames corresponding to the sequence of images from the cameras are transmitted to the respective image recognition engines 112a-112n at a rate of 30 frames per second (fps). Each image frame has a time stamp, camera identification (abbreviated as "camera_id"), and frame identification (abbreviated as "frame_id"), along with image data. The image frames are stored in a circular buffer 1502 (also referred to as a ring buffer) per camera 114. The circular buffer 1502 stores a collection of consecutive time stamped image frames from the respective cameras 114.

關節CNN處理每相機之影像框的序列並識別出現在其各別觀看域中之各主體的18個不同類型的關節。相應於具有重疊觀看域之相機的關節CNN 112a-112n之輸出被結合以將來自各相機之2D影像座標的關節之位置映射至真實空間之3D座標。每主體(j)之關節資料結構800(其中j等於1至x)識別真實空間中之主體(j)的關節之位置。主體資料結構800之細節被提出於圖8中。於一範例實施例中，關節資料結構800為各主體之關節的二階密鑰-值字典。第一密鑰為框_數而該值為第二密鑰-值字典，以該密鑰為相機_id而該值為指派給主體之關節的列表。The joint CNN processes the sequence of image frames for each camera and identifies 18 different types of joints for each subject appearing in its respective viewing domain. The outputs of the joints CNN 112a-112n corresponding to cameras with overlapping viewing fields are combined to map the positions of the joints from the 2D image coordinates of each camera to the 3D coordinates of real space. The joint data structure 800 (where j is equal to 1 to x) of each subject (j) identifies the position of the joint of the subject (j) in the real space. Details of the master data structure 800 are presented in FIG. 8. In an exemplary embodiment, the joint data structure 800 is a second-order key-value dictionary of the joints of each subject. The first key is the frame_number and the value is the second key-value dictionary, the key is the camera_id and the value is a list of joints assigned to the subject.

包含由關節資料結構800所識別之主體以及來自每相機之影像框的序列之相應影像框的資料集被提供為輸入至第三影像處理器子系統2606中之定界框產生器1504。第三影像處理器子系統進一步包含前台影像辨識引擎。於一實施例中，前台影像辨識引擎係語意地辨識前台中之重要物件(亦即，購物者、其手以及存貨項目)，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。於圖15A中所示之範例實施方式中，前台影像辨識引擎被實施為WhatCNN 1506及WhenCNN 1508。定界框產生器1504實施用以處理資料集之邏輯，來指明其包括影像之序列中的影像中之已識別主體的手之影像的定界框。定界框產生器1504係識別每相機之各來源影像框中的手關節之位置，使用相應於各別來源影像框之多關節資料結構800中的手關節之位置。於一實施例中，其中主體資料結構中之關節的座標係指示3D真實空間座標中的關節之位置，定界框產生器係將來自3D真實空間座標之關節位置映射至各別來源影像之影像框中的2D座標。A data set containing a subject identified by the joint data structure 800 and a corresponding image frame of a sequence of image frames from each camera is provided as a delimited frame generator 1504 input to a third image processor subsystem 2606. The third image processor subsystem further includes a foreground image recognition engine. In one embodiment, the front-end image recognition engine semantically recognizes important objects (i.e., shoppers, hands, and inventory items) in the front-end, because it is related to, for example, images from various cameras over time. Dropping and removing of passing inventory items. In the example implementation shown in FIG. 15A, the foreground image recognition engine is implemented as WhatCNN 1506 and WhenCNN 1508. The bounding box generator 1504 implements logic to process the data set to specify the bounding box of the image of the identified subject's hand in the images in its sequence of images. The bounding box generator 1504 identifies the position of the hand joint in each source image frame of each camera, using the position of the hand joint in the multi-joint data structure 800 corresponding to each source image frame. In an embodiment, where the coordinates of the joints in the main data structure indicate the positions of the joints in the 3D real space coordinates, the bounding box generator maps the positions of the joints from the 3D real space coordinates to the images of the respective source images 2D coordinates in the box.

定界框產生器1504產生針對影像框中之手關節的定界框於每相機114之循環緩衝器中。於一實施例中，定界框為影像框之128像素(寬度)x128像素(高度)部分，以該手關節位於該定界框之中心。於其他實施例中，定界框之大小為64像素x64像素或32像素x32像素。針對來自相機之影像框中的m個主體，可以有最多2m個手關節，因而有2m個定界框。然而，實際上於影像框中有少於2m個手可見，因為由於其他主體或其他物件的阻擋。於一範例實施例中，主體的手位置被推斷自手肘及手腕關節的位置。例如，主體的右手位置被外推，其係使用右手肘(識別為p1)及右手腕(識別為p2)的位置為外推_量*(p2-p1)+ p2，其中外推_量等於0.4。於另一實施例中，關節CNN 112a-112n係使用左及右手影像來訓練。因此，於此一實施例中，關節CNN 112a-112n直接地識別每相機之影像框中的手關節之位置。每影像框之手位置係由定界框產生器1504所使用以產生每已識別手關節之定界框。The bounding box generator 1504 generates a bounding box for a hand joint in an image frame in a circular buffer of each camera 114. In one embodiment, the bounding box is a 128 pixel (width) x 128 pixel (height) portion of the image box, and the hand joint is located at the center of the bounding box. In other embodiments, the size of the bounding box is 64 pixels x 64 pixels or 32 pixels x 32 pixels. For m subjects from the image frame of the camera, there can be a maximum of 2m hand joints, so there are 2m bounding frames. However, actually, less than 2m hands are visible in the image frame, because it is blocked by other subjects or other objects. In an exemplary embodiment, the subject's hand position is inferred from the positions of the elbow and wrist joints. For example, the position of the subject ’s right hand is extrapolated, which uses the position of the right elbow (recognized as p1) and the right wrist (recognized as p2) as extrapolation_amount * (p2-p1) + p2, where the extrapolation_amount is 0.4. In another embodiment, the joint CNNs 112a-112n are trained using left and right hand images. Therefore, in this embodiment, the joints CNN 112a-112n directly identify the position of the hand joint in the image frame of each camera. The hand position of each image frame is used by the bounding frame generator 1504 to generate a bounding frame for each identified hand joint.

WhatCNN 1506為一種卷積神經網路，其被訓練以處理影像中之已指明定界框來產生已識別主體之手的類別。一經訓練的WhatCNN 1506係處理來自一相機之影像框。於購物商店之範例實施例中，針對各影像框中之各手關節，WhatCNN 1506係識別該手關節是否為空的。WhatCNN 1506亦識別手關節中之存貨項目的SKU(庫存保持單元)數、指示手關節中之項目為非SKU項目(亦即，其不屬於購物商店存貨)的信心值、以及影像框中之手關節位置的背景。WhatCNN 1506 is a convolutional neural network that is trained to process indicated bounding boxes in an image to generate categories of hands of identified subjects. WhatCNN 1506 was trained to process image frames from a camera. In the exemplary embodiment of the shopping store, for each hand joint in each image frame, WhatCNN 1506 identifies whether the hand joint is empty. WhatCNN 1506 also identifies the number of SKUs (inventory holding units) of the inventory items in the hand joints, the confidence value indicating that the items in the hand joints are non-SKU items (that is, they are not part of the shopping store inventory), and the hand in the image frame Background of the joint position.

所有相機114之WhatCNN模型1506的輸出係由單一WhenCNN模型1508針對預定的時間窗來處理。於購物商店之範例中，WhenCNN 1508針對主體之兩手履行時間序列分析以識別主體是否從貨架取走商店存貨項目或者將商店存貨項目放在貨架上。購物車資料結構1510(亦稱為包括存貨項目之列表的日誌資料結構)針對每主體而被產生以保存與該主體關聯的購物車(或籃)中之商店存貨項目的記錄。The outputs of the WhatCNN model 1506 of all cameras 114 are processed by a single WhenCNN model 1508 for a predetermined time window. In the example of a shopping store, WhenCNN 1508 performs a time series analysis on the two hands of the subject to identify whether the subject takes the store inventory item from the shelf or puts the store inventory item on the shelf. A shopping cart data structure 1510 (also referred to as a log data structure including a list of inventory items) is generated for each subject to keep a record of store inventory items in a shopping cart (or basket) associated with that subject.

第二影像處理器子系統2604接收相同資料集為送至第三影像處理器之給定輸入，該些相同資料集包含由關節資料結構800所識別的主體以及來自每相機之影像框的序列之相應影像框。子系統2604包括前台影像辨識引擎，其係語意地辨識前台(亦即，如貨架等存貨展示結構)中之重要差異，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。選擇邏輯組件(未顯示於圖15A中)係使用信心分數以選擇來自第二影像處理器或第三影像處理器之任一者的輸出以產生購物車資料結構1510。The second image processor subsystem 2604 receives the same data set as a given input to the third image processor. The same data set includes the subject identified by the joint data structure 800 and a sequence of image frames from each camera. The corresponding image frame. Subsystem 2604 includes a front-end image recognition engine that semantically identifies important differences in the front-end (i.e., inventory display structures such as shelves) because it is related to, for example, the passage of time in images from various cameras over time Dropping and taking away of inventory items. The selection logic component (not shown in FIG. 15A) uses the confidence score to select the output from either the second image processor or the third image processor to generate a shopping cart data structure 1510.

圖15B顯示協調邏輯模組1522，其係結合多數WhatCNN模型之結果並將提供為送至單一WhenCNN模型之輸入。如上所述，具有重疊觀看域之二或更多相機係擷取真實空間中之主體的影像。單一主體之關節可出現在各別影像頻道1520中之多數相機的影像框中。分離的WhatCNN模型係識別主體之手(由手關節所表示)中的存貨項目之SKU。協調邏輯模組1522將WhatCNN模型之輸出結合入WhenCNN模型之單一合併輸入。WhenCNN模型1508係操作於該合併輸入上以產生該主體之購物車。FIG. 15B shows the coordination logic module 1522, which combines the results of most WhatCNN models and will provide input to a single WhenCNN model. As mentioned above, two or more cameras with overlapping viewing fields capture images of subjects in real space. A joint of a single subject may appear in the image frame of most cameras in the respective image channel 1520. The separated WhatCNN model identifies the SKU of the inventory item in the subject's hand (represented by the hand joint). The coordination logic module 1522 combines the output of the WhatCNN model into a single merged input of the WhenCNN model. WhenCNN model 1508 operates on the merged input to generate a shopping cart for the subject.

包含圖15A之多CNN管線的系統之詳細實施方式被提出於圖16、17、及18中。於購物商店之範例中，系統係追蹤真實空間之區域中藉由主體之存貨項目的放下及取走。真實空間之區域為購物商店，其具有存貨項目放置於如圖2及3中所示之走道中所組織的貨架中。應理解：含有存貨項目之貨架可被組織以多種不同的配置。例如，貨架可被配置成直線，以其背側靠著購物商店之牆壁而前側面朝向真實空間中之開放區域。於真實空間中具有重疊觀看域之複數相機114係產生其相應觀看域之影像的序列。一相機的觀看域係與如圖2及3中所示之至少一其他相機的該觀看域重疊。關節CNN - 主體之識別及更新Detailed implementations of the system including the multiple CNN pipeline of FIG. 15A are presented in FIGS. 16, 17, and 18. In the example of a shopping store, the system tracks the drop and removal of inventory items by the subject in areas of real space. The area of the real space is a shopping store, which has inventory items placed on shelves organized in the aisle as shown in FIGS. 2 and 3. It should be understood that shelves containing inventory items can be organized in a number of different configurations. For example, the shelves may be configured in a straight line with its back side against the wall of a shopping store and its front side facing an open area in a real space. The plurality of cameras 114 having overlapping viewing fields in real space generates a sequence of images of their corresponding viewing fields. The viewing field of one camera overlaps the viewing field of at least one other camera as shown in FIGS. 2 and 3. Joint CNN-Identification and Update of Subjects

圖16為由關節CNN 112a-112n所履行以識別真實空間中之主體的處理步驟之流程圖。於購物商店之範例中，主體為移動於貨架與其他開放空間之間的走道中之商店中的消費者。該程序開始於步驟1602。注意：如上所述，相機被調校在來自相機之影像的序列被處理以識別主體之前。相機調校之細節被提出如上。具有重疊觀看域之相機114係擷取其中有主體出現之真實空間的影像(步驟1604)。於一實施例中，相機被組態成產生影像之同步化序列。各相機之影像的序列被儲存於每相機之各別循環緩衝器1502中。循環緩衝器(亦稱為環緩衝器)係儲存時間之滑動窗中的影像之序列。於一實施例中，循環緩衝器係儲存110來自相應相機之影像框。於另一實施例中，各循環緩衝器1502係儲存針對3.5秒之時間週期的影像框。應理解：於其他實施例中，影像框(或時間週期)之數目可大於或小於以上列出的範例值。FIG. 16 is a flowchart of processing steps performed by the joints CNN 112a-112n to identify a subject in a real space. In the example of a shopping store, the subject is a consumer in a store moving in a hallway between a shelf and other open spaces. The program starts at step 1602. Note: As mentioned above, the camera is calibrated before the sequence of images from the camera is processed to identify the subject. The details of camera calibration are presented above. The camera 114 with an overlapping viewing field captures an image of the real space in which the subject appears (step 1604). In one embodiment, the camera is configured to generate a synchronized sequence of images. The sequence of images from each camera is stored in a separate circular buffer 1502 for each camera. A circular buffer (also known as a ring buffer) is a sequence of images in a sliding window that stores time. In one embodiment, the circular buffer stores 110 image frames from corresponding cameras. In another embodiment, each circular buffer 1502 stores an image frame for a time period of 3.5 seconds. It should be understood that, in other embodiments, the number of image frames (or time periods) may be larger or smaller than the exemplary values listed above.

關節CNN 112a-112n係接收來自相應相機114之影像框的序列(步驟1606)。各關節CNN係透過多數卷積網路層以處理來自相應相機之影像的批次以識別來自相應相機之影像框中的主體之關節。藉由範例卷積神經網路之影像的架構及處理被提出於圖5中。因為相機114具有重疊觀看域，所以主體之關節係由多於一個關節CNN來識別。由關節CNN所產生之關節資料結構600的二維(2D)座標被映射至真實空間之三維(3D)座標以識別真實空間中之關節位置。此映射之細節被提出於圖7之討論，其中追蹤引擎110將相應於不同影像序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。The joints CNN 112a-112n receive a sequence of image frames from the corresponding camera 114 (step 1606). Each joint CNN processes batches of images from the corresponding cameras through most convolutional network layers to identify the joints of the subjects from the image frames of the corresponding cameras. The architecture and processing of the image with an example convolutional neural network is presented in FIG. 5. Because the camera 114 has overlapping viewing fields, the subject's joint system is identified by more than one joint CNN. The two-dimensional (2D) coordinates of the joint data structure 600 generated by the joint CNN are mapped to the three-dimensional (3D) coordinates of the real space to identify the joint position in the real space. The details of this mapping are presented in the discussion of FIG. 7, where the tracking engine 110 transforms the coordinates of elements in an array of joint data structures corresponding to the images in different image sequences into candidate joints with coordinates in real space.

主體之關節被組織成兩種類(足部關節及非足部關節)以將該些關節分組成為群集，如以上所討論。於目前範例中之左及右腳踝關節類型被視為足部關節，以供此程序之目的。於步驟1608，啟發法被應用以指派候選左足部關節及候選右足部關節給候選關節之集合以產生主體。接續於此，於步驟1610，判定新識別的主體是否已存在於真實空間中。假如為否，則新主體被產生於步驟1614，否則，現存主體被更新於步驟1612。The joints of the subject are organized into two classes (foot joints and non-foot joints) to group these joints into clusters, as discussed above. The left and right ankle joint types in the current example are considered foot joints for the purpose of this procedure. At step 1608, a heuristic is applied to assign candidate left foot joints and candidate right foot joints to the set of candidate joints to generate a subject. Continuing here, in step 1610, it is determined whether the newly identified subject already exists in the real space. If not, the new subject is generated in step 1614, otherwise, the existing subject is updated in step 1612.

來自候選關節之星系的其他關節可被鏈結至該主體以建立該產生的主體之部分或所有關節類型的群集。於步驟1616，啟發法被應用至非足部關節以指派那些給已識別主體。總體量度計算器702計算總體量度值並嘗試藉由檢查非足部關節之不同組合以將該值減至最小。於一實施例中，該總體量度為組織於四個種類中之啟發法的總和，如上所述。Other joints from the candidate joint galaxy can be linked to the subject to build a cluster of some or all joint types of the resulting subject. At step 1616, heuristics are applied to non-foot joints to assign those to the identified subjects. The overall metric calculator 702 calculates an overall metric value and attempts to minimize the value by examining different combinations of non-foot joints. In one embodiment, the overall metric is the sum of heuristics organized in four categories, as described above.

用以識別候選關節之集合的該邏輯包含根據真實空間中之主體的關節之間的物理關係之啟發函數，用以將候選關節之集合識別為主體。於步驟1618，現存主體係使用相應非足部關節而被更新。假如有更多影像以供處理(步驟1620)，則步驟1606至1618被重複，否則該程序結束於步驟1622。第一資料集被產生於上述程序之結束時。第一資料集係識別主體以及真實空間中之已識別主體的位置。於一實施例中，第一資料集係相關於圖15A而被提出於上為每主體之關節資料結構800。 WhatCNN - 手關節之分類The logic used to identify the set of candidate joints includes a heuristic function based on the physical relationship between the joints of the subject in the real space to identify the set of candidate joints as the subject. At step 1618, the existing master system is updated using the corresponding non-foot joints. If there are more images for processing (step 1620), steps 1606 to 1618 are repeated, otherwise the process ends at step 1622. The first data set was generated at the end of the above procedure. The first data set identifies the subject and the location of the identified subject in real space. In one embodiment, the first data set is related to FIG. 15A and is presented above as a joint data structure 800 for each subject. WhatCNN-Classification of Hand Joints

圖17為流程圖，其闡明用以識別真實空間中所識別之主體的手中之存貨項目的處理步驟。於購物商店之範例中，主體為購物商店中之消費者。當消費者移動於走道及開放空間中時，其拾起貨架中所堆放的存貨項目並將該些項目放入其購物車或籃中。影像辨識引擎識別其接收自複數相機之影像的序列中之影像的集合中之主體。該系統包括邏輯，用以處理其包括已識別主體的影像之該些序列中的影像之集合以檢測由已識別主體取走存貨項目及由已識別主體放下存貨項目於貨架上。FIG. 17 is a flowchart illustrating processing steps for identifying an inventory item in the hands of an identified subject in a real space. In the example of a shopping store, the subject is a consumer in a shopping store. When consumers move in aisles and open spaces, they pick up inventory items stacked on shelves and put those items in their shopping carts or baskets. The image recognition engine identifies a subject in a collection of images in a sequence of images it receives from a plurality of cameras. The system includes logic to process a collection of images in those sequences that include images of the identified subject to detect removal of inventory items by the identified subject and drop of inventory items on the shelf by the identified subject.

於一實施例中，用以處理影像之集合的該邏輯包括(針對已識別主體)邏輯，用以處理影像來產生已識別主體之影像的類別。該些類別包括該已識別主體是否持有存貨項目。該些類別包括第一接近度類別，其係指示該已識別主體的手相對於貨架之位置。該些類別包括第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置。該些類別進一步包括第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置。最後，該些類別包括可能存貨項目之識別符。In one embodiment, the logic for processing a collection of images includes (for identified subjects) logic for processing the images to generate categories of images of the identified subjects. The categories include whether the identified entity holds inventory items. The categories include a first proximity category, which indicates the position of the identified subject's hand relative to the shelf. The categories include a second proximity category, which indicates the position of the hand of the identified subject relative to the body of the identified subject. The categories further include a third proximity category, which indicates the position of the identified subject's hand relative to a basket associated with the identified subject. Finally, these categories include identifiers of possible inventory items.

於另一實施例中，用以處理影像之集合的該邏輯包括(針對已識別主體)邏輯，用以識別其代表該些已識別主體之影像的集合中之影像中的手之資料的定界框。定界框中之資料被處理以產生針對該些已識別主體之定界框內的資料之類別。於此一實施例中，該類別係識別該已識別主體是否持有存貨項目。該些類別包括第一接近度類別，其係指示該已識別主體的手相對於貨架之位置。該些類別包括第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置。該些類別包括第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置。最後，該些類別包括可能存貨項目之識別符。In another embodiment, the logic used to process the collection of images includes (for identified subjects) logic to identify the delimitation of the data of the hand in the image in the collection of images representing the identified subjects frame. The data in the bounding box is processed to generate categories of data in the bounding box for the identified subjects. In this embodiment, the category identifies whether the identified entity holds inventory items. The categories include a first proximity category, which indicates the position of the identified subject's hand relative to the shelf. The categories include a second proximity category, which indicates the position of the hand of the identified subject relative to the body of the identified subject. The categories include a third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject. Finally, these categories include identifiers of possible inventory items.

該程序開始於步驟1702。於步驟1704，影像框中之主體的手(由手關節所表示)之位置被識別。定界框產生器1504係識別來自各相機之每框的主體之手位置，使用由如圖18中所述之關節CNN 112a-112n所產生的第一資料集中所識別的關節位置。接續於此，於步驟1706，定界框產生器1504係處理資料集以指明其包括影像之序列中的影像中之已識別多關節主體的手之影像的定界框。定界框產生器之細節被提出於以上之圖15A的討論中。The procedure starts at step 1702. At step 1704, the position of the subject's hand (represented by the hand joint) in the image frame is identified. The delimited frame generator 1504 recognizes the hand position of the subject from each frame of each camera, using the joint positions identified in the first data set generated by the joints CNN 112a-112n as described in FIG. Continuing here, in step 1706, the bounding box generator 1504 processes the data set to specify the bounding box of the image of the hand of the identified multi-joint subject in the images in its sequence including the images. Details of the bounding box generator are presented in the discussion of FIG. 15A above.

第二影像辨識引擎係接收來自該些複數相機之影像的序列並處理該些影像中之指明的定界框以產生該已識別主體之手的類別(步驟1708)。於一實施例中，用以根據手之影像來分類該些主體的該些影像辨識引擎之各者包含經訓練的卷積神經網路，其被稱為WhatCNN 1506。WhatCNN被配置於多CNN管線中，如以上相關於圖15A所述。於一實施例中，送至WhatCNNj輸入為多維陣列BxWxHxC (亦稱為BxWxHxC張量)。「B」為批次大小，其係指示由WhatCNN所處理之影像的批次中之影像框的數目。「W」及「H」係指示像素中之定界框的寬度及高度，「C」為頻道之數目。於一實施例中，有30個影像於一批次中(B=30)，因此定界框之大小為32像素(寬度)x 32像素(高度)。可有六個頻道，其各別地代表紅、綠、藍、前台遮罩、前臂遮罩及上臂遮罩。前台遮罩、前臂遮罩及上臂遮罩是於此範例中針對WhatCNN之額外的及選擇性的輸入資料來源，其為CNN可包括於處理中以分類RGB影像資料中之資訊。前台遮罩可使用(例如)高斯演算法之混合而被產生。前臂遮罩可為介於手腕與手肘之間的線，其係提供使用關節資料結構中之資訊所產生的背景。同樣地，上臂遮罩可為介於手肘與肩膀之間的線，其係使用關節資料結構中之資訊所產生。B、W、H及C參數之不同值可被使用於其他實施例中。例如，於另一實施例中，定界框之大小是較大的，例如，64像素(寬度)x 64像素(高度)或128像素(寬度)x 128像素(高度)。The second image recognition engine receives a sequence of images from the plurality of cameras and processes a designated bounding box in the images to generate a category of the hand of the identified subject (step 1708). In one embodiment, each of the image recognition engines used to classify the subjects based on the image of the hand includes a trained convolutional neural network, which is called WhatCNN 1506. WhatCNN is configured in a multiple CNN pipeline, as described above in relation to FIG. 15A. In one embodiment, the input to WhatCNNj is a multi-dimensional array BxWxHxC (also known as a BxWxHxC tensor). "B" is the batch size, which indicates the number of image frames in the batch of images processed by WhatCNN. "W" and "H" indicate the width and height of the bounding box in the pixel, and "C" is the number of channels. In one embodiment, there are 30 images in a batch (B = 30), so the size of the bounding box is 32 pixels (width) x 32 pixels (height). There can be six channels, each of which represents red, green, blue, foreground mask, forearm mask and upper arm mask. The foreground mask, forearm mask, and upper arm mask are additional and selective input data sources for WhatCNN in this example. It is the information that CNN can include in processing to classify RGB image data. The foreground mask may be generated using, for example, a mixture of Gaussian algorithms. The forearm mask can be a line between the wrist and elbow, which provides the background generated by using the information in the joint data structure. Similarly, the upper arm mask may be a line between the elbow and the shoulder, which is generated using information in the joint data structure. Different values of the B, W, H, and C parameters can be used in other embodiments. For example, in another embodiment, the size of the bounding box is larger, for example, 64 pixels (width) x 64 pixels (height) or 128 pixels (width) x 128 pixels (height).

各WhatCNN 1506係處理影像的批次以產生該些已識別主體之手的類別。該些類別包括該已識別主體是否持有存貨項目。該些類別包括一或更多類別，其係指示該些手相對於該貨架及相對於該主體之位置，無法檢測放下及取走。於此範例中，第一接近度類別係指示該已識別主體的手相對於貨架之位置。該些類別於此範例中包括第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置，其中主體可能於購物期間持有存貨項目。該些類別於此範例中進一步包括第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置，其中「籃子」於此背景下為由該主體用以於購物期間持有存貨項目的袋子、籃子、車或其他物件。最後，該些類別包括可能存貨項目之識別符。WhatCNN 1506之最後層係產生羅吉特(logits)，其為預測之原始值。羅吉特被表示為浮點值並進一步處理(如以下所述)，以供產生分類結果。於一實施例中，WhatCNN模型之輸出包括多維陣列BxL(亦稱為BxL張量)。「B」為批次大小，而「L=N+5」為每影像框之羅吉特輸出的數目。「N」為SKU之數目，其代表購物商店中供銷售之「N」個獨特存貨項目。Each WhatCNN 1506 is processing batches of images to generate categories for the hands of the identified subjects. The categories include whether the identified entity holds inventory items. The categories include one or more categories, which indicate that the positions of the hands relative to the shelf and relative to the subject cannot be detected for being dropped and removed. In this example, the first proximity category indicates the position of the identified subject's hand relative to the shelf. The categories include a second proximity category in this example, which indicates the position of the hand of the identified subject relative to the body of the identified subject, where the subject may hold inventory items during shopping. The categories further include a third proximity category in this example, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, where "basket" is used by the subject to Bags, baskets, carts, or other items held in stock during shopping. Finally, these categories include identifiers of possible inventory items. The last layer of WhatCNN 1506 generates logits, which are the predicted raw values. Logit is represented as a floating point value and further processed (as described below) for generating classification results. In one embodiment, the output of the WhatCNN model includes a multi-dimensional array BxL (also known as a BxL tensor). "B" is the batch size, and "L = N + 5" is the number of logit outputs per image frame. "N" is the number of SKUs, which represents "N" unique inventory items for sale in the shopping store.

每影像框之輸出「L」為來自WhatCNN 1506之原始啟動。羅吉特「L」被處理於步驟1710以識別存貨項目及背景。首「N」個羅吉特係代表其該主體正持有「N」個存貨項目之一的信心。羅吉特「L」包括額外五(5)個羅吉特，其被解釋於下。第一羅吉特代表其在該主體之手中的項目之影像不是商店SKU項目之一(亦稱為非SKU項目)的信心。第二羅吉特係指示該主體是否持有項目的信心。大的正值係指示WhatCNN模型具有其該主體正持有項目之高的信心位準。大的負值係指示該模型有信心其該主體並未持有任何項目。第二羅吉特之接近零的值係指示WhatCNN模型沒有信心來預測該主體是否持有項目。The output "L" for each image frame is the original activation from WhatCNN 1506. Logit "L" is processed at step 1710 to identify inventory items and background. The first “N” logit represents the confidence that the entity is holding one of the “N” inventory items. Logite "L" includes an additional five (5) Logites, which are explained below. First Logit represents the confidence that the image of its item in the hands of the subject is not one of the store's SKU items (also known as non-SKU items). The second logit indicates whether the entity has confidence in the project. A large positive value indicates that the WhatCNN model has a high level of confidence that the entity is holding the item. A large negative value indicates that the model is confident that the entity does not hold any items. A near-zero value for the second logit indicates that the WhatCNN model does not have the confidence to predict whether the subject holds the item.

接下來三個羅吉特係代表第一、第二及第三接近度類別，包括：第一接近度類別，其係指示該已識別主體的手相對於貨架之位置、第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置、及第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置。因此，這三個羅吉特係代表具有一羅吉特之手位置的背景，各指示其手之背景接近於貨架、接近於籃子(或購物車)、或接近於該主體的身體之信心。於一實施例中，WhatCNN係使用一含有三個背景下之手影像的訓練資料集來訓練：接近於貨架、接近於籃子(或購物車)、及接近於主體的身體。於另一實施例中，「接近度」參數係由該系統所使用以分類手的背景。於此一實施例中，該系統係判定該已識別主體的手與貨架、籃子(或購物車)、及該主體的身體之距離，以分類該背景。The next three Logitets represent the first, second, and third proximity categories, including: the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, and the second proximity category, which Indicates the position of the hand of the identified subject relative to the body of the identified subject, and a third proximity category, which indicates the position of the hand of the identified subject relative to the basket associated with the identified subject. Therefore, these three Loggetts represent a background with the location of a logit's hand, each indicating that the background of their hand is close to a shelf, close to a basket (or shopping cart), or close to the body's confidence. In one embodiment, WhatCNN is trained using a training data set containing hand images in three backgrounds: close to a shelf, close to a basket (or shopping cart), and close to the subject's body. In another embodiment, the "closeness" parameter is used by the system to classify the background of the hand. In this embodiment, the system determines the distance between the identified subject's hand and the shelf, basket (or shopping cart), and the subject's body to classify the background.

WhatCNN之輸出為「L」羅吉特，其包括：N個SKU羅吉特、1個非SKU羅吉特、1個持有羅吉特、及3個背景羅吉特，如上所述。SKU羅吉特(首N個羅吉特)及非SKU羅吉特(接續於該些N個羅吉特後之第一羅吉特)係由softmax函數所處理。如以上參考圖5所述，softmax函數係將任意實值之K維向量變換至範圍[0, 1](其向上加至1)中的實值之K維向量。softmax函數計算其涵蓋N + 1項目之項目的機率分佈。輸出值係介於0與1之間，所有機率之總和等於一。softmax函數(用於多類別分類)係返回各類別之機率。具有最高機率之類別是預測類別(亦稱為目標類別)。The output of WhatCNN is "L" logit, which includes: N SKU logit, 1 non-SKU logit, 1 holding logit, and 3 background logit, as described above. SKU logit (first N logit) and non-SKU logit (first logit following those N logit) are processed by the softmax function. As described above with reference to FIG. 5, the softmax function transforms an arbitrary real-valued K-dimensional vector into a real-valued K-dimensional vector in the range [0, 1] (which is increased up to 1). The softmax function calculates the probability distribution of items covering N + 1 items. The output value is between 0 and 1, and the sum of all probabilities is equal to one. The softmax function (for multi-class classification) returns the probability of each class. The category with the highest probability is the predicted category (also known as the target category).

持有羅吉特係由S形函數所處理。S形函數具有實數值為輸入並產生0至1之範圍中的輸出值。S形函數之輸出係識別該手是空的或是持有項目。三個背景羅吉特係由softmax函數所處理以識別手關節位置之背景。於步驟1712，檢查是否有更多影像待處理。假如是的話，則步驟1704-1710被重複，否則該程序於步驟1714結束。用以識別項目之放下及取走的WhenCNN - 時間序列分析Holding Logit is handled by S-shaped functions. The sigmoid function has real values as inputs and produces output values in the range of 0 to 1. The output of the sigmoid function identifies whether the hand is empty or holding the item. The three background logit are processed by the softmax function to identify the background of the hand joint position. In step 1712, it is checked whether there are more images to be processed. If so, steps 1704-1710 are repeated, otherwise the procedure ends at step 1714. WhenCNN to identify dropped and removed items-time series analysis

於一實施例中，該系統係實施邏輯以履行涵蓋主體之類別的時間序列分析，以根據該些主體之前台影像處理來檢測藉由該些已識別主體之取走及放下。時間序列分析係識別該些主體之姿勢以及與影像之序列中所表示的該些姿勢關聯之存貨項目。In one embodiment, the system implements logic to perform time series analysis of categories that cover subjects, to detect the removal and drop of the identified subjects based on the pre-stage image processing of the subjects. Time series analysis identifies the poses of the subjects and the inventory items associated with the poses represented in the sequence of the images.

於多CNN管線中之WhatCNN 1506的輸出被提供為輸入至WhenCNN 1508，其係處理這些輸入以檢測藉由該些已識別主體之取走及放下。最後，該系統包括邏輯(其係回應於檢測到的取走及放下)以產生日誌資料結構，其包括針對各已識別主體之存貨項目的列表。於購物商店之範例中，日誌資料結構亦被稱為每主體之購物車資料結構1510。The output of WhatCNN 1506 in the multiple CNN pipeline is provided as input to WhenCNN 1508, which processes these inputs to detect removal and drop by the identified subjects. Finally, the system includes logic (which is in response to the detected take and drop) to generate a log data structure that includes a list of inventory items for each identified entity. In the example of a shopping store, the log data structure is also referred to as the shopping cart data structure 1510 per subject.

圖18提出一種程序，其係實施邏輯以產生每主體之購物車資料結構。該程序開始於步驟1802。WhenCNN 1508之輸入被準備於步驟1804。該WhenCNN之輸入是多維陣列BxCxTxCams，其中B是批次大小，C視頻道之數目，T是針對一時間窗所考量的框之數目，及Cams是相機114之數目。於一實施例中，批次大小「B」為64而「T」之該值為110個影像框或者在3.5秒的時間內之影像框的數目。Figure 18 presents a program that implements logic to generate a shopping cart data structure for each subject. The program starts at step 1802. The input of WhenCNN 1508 is prepared in step 1804. The inputs to the WhenCNN are a multidimensional array BxCxTxCams, where B is the batch size, the number of C video tracks, T is the number of frames considered for a time window, and Cams is the number of cameras 114. In one embodiment, the batch size "B" is 64 and the value of "T" is 110 image frames or the number of image frames in 3.5 seconds.

針對每影像框、每相機所識別之各主體，每手關節之10羅吉特(針對兩手之20羅吉特)的列表被產生。持有及背景羅吉特為由WhatCNN 1506所產生之「L」羅吉特的部分，如上所述。 For each image frame and each subject identified by each camera, a list of 10 logits per hand joint (20 logits for both hands) is generated. The holding and background logit is part of the "L" logit produced by WhatCNN 1506, as described above.

上述資料結構被產生給影像框中之各手且亦包括有關該相同主體之另一手。例如，假如資料係針對主體之左手關節，則針對右手之相應值被包括為「其他」羅吉特。第五羅吉特(被稱為log_sku之上述列表中的項目編號3)為上述「L」羅吉特中之SKU羅吉特的對數。第六羅吉特為另一手之SKU羅吉特的對數。「roll」函數係產生相同資訊在目前框之前及之後。例如，第七羅吉特(稱為roll(log_sku, -30))為SKU羅吉特之對數，比目前框更早30框。第八羅吉特為針對該手之SKU羅吉特的對數，比目前框更晚30框。該列表中之第九及第十資料為另一手之類似資料，比目前框更早30框及更晚30框。針對另一手之類似資料結構亦被產生，導致每相機之每影像框之每主體的總共20羅吉特。因此，於針對WhenCNN之輸入中的頻道數為20(亦即，C=20於多維陣列BxCxTxCams中)。The above data structure is generated for each hand in the image frame and also includes the other hand about the same subject. For example, if the data is for the left hand joint of the subject, the corresponding value for the right hand is included as "other" logit. The fifth logit (item number 3 in the above list called log_sku) is the logarithm of the SKU logit in the above "L" logit. The sixth logit is the log of the other hand SKU logit. The "roll" function generates the same information before and after the current box. For example, the seventh logit (called roll (log_sku, -30)) is the logarithm of SKU logit, which is 30 frames earlier than the current box. The eighth log is the logarithm of the SKU log, which is 30 frames later than the current frame. The ninth and tenth information in the list are similar information of the other hand, 30 frames earlier and 30 frames later than the current frame. A similar data structure for the other hand was also generated, resulting in a total of 20 logits per subject per camera per frame. Therefore, the number of channels in the input for WhenCNN is 20 (that is, C = 20 in the multi-dimensional array BxCxTxCams).

針對來自各相機之影像框的批次中之所有影像框(例如，B=64)，每主體之20個手羅吉特的類似資料結構(於該影像框中識別)被產生。時間窗(T=3.5秒或110影像框)被用以向前及向後搜尋針對主體之手關節的影像框之序列中的影像框。於步驟1806，每框之每主體的20個手羅吉特被合併自多CNN管線。於一實施例中，影像框(64)之批次可被想像為影像框之較小窗，其被置於影像框110之較大窗的中間，具有額外的影像框以供兩側上之向前及向後搜尋。針對WhenCNN 1508之輸入BxCxTxCams係由以下所組成：來自所有相機114(稱為「Cams」)之影像框的批次「B」中所識別的主體之兩手的20羅吉特。合併的輸入被提供至單一經訓練的卷積神經網路(稱之為WhenCNN模型1508)。For all the image frames (for example, B = 64) in the batch of image frames from each camera, a similar data structure (identified in the image frame) of 20 hands per subject is generated. A time window (T = 3.5 seconds or 110 image frames) is used to search forward and backward for image frames in a sequence of image frames for the subject's hand joints. At step 1806, the 20 handed Loggies per subject per frame are merged from the multi-CNN pipeline. In one embodiment, the batch of image frames (64) can be imagined as the smaller window of the image frame, which is placed in the middle of the larger window of the image frame 110, with additional image frames for the two sides Search forward and backward. The input BxCxTxCams for WhenCNN 1508 is composed of 20 logits of both hands of the subjects identified in the batch "B" of the image frames from all cameras 114 (called "Cams"). The combined inputs are provided to a single trained convolutional neural network (referred to as WhenCNN model 1508).

WhenCNN模型之輸出包含3羅吉特，其代表已識別主體之三個可能動作中的信心：從貨架取走存貨項目、將存貨項目放回該貨架上、及無動作。三個輸出羅吉特係由softmax函數所處理以預測所履行的動作。三個類別羅吉特係針對各主體而被產生以規律的間隔，且結果係針對每人而被儲存(連同時戳)。於一實施例中，三個羅吉特被產生於每主體每二十框。於此一實施例中，於每主體每二十影像框之間隔，110個影像框之窗被形成於目前影像框周圍。The output of the WhenCNN model includes 3 logit, which represents confidence in the three possible actions of the identified subject: removing the inventory item from the shelf, putting the inventory item back on the shelf, and no action. The three output logits are processed by the softmax function to predict the actions performed. The three categories of Rogette are generated at regular intervals for each subject, and the results are stored for each person (with simultaneous stamping). In one embodiment, three Loggies are generated per twenty frames per subject. In this embodiment, a window of 110 image frames is formed around the current image frame at intervals of every twenty image frames per subject.

在一段時間週期期間之每主體的這三個羅吉特之時間序列分析被履行(步驟1808)以識別相應於真實事件之姿勢以及其發生之時間。非最大抑制(NMS)演算法被使用於此目的。當由WhenCNN 1508多次地(來自相同相機且來自多數相機)檢測到一事件(亦即，藉由主體之項目的放下及取走)時，NMS便移除針對一主體之多餘事件。NMS為一種包含兩個主要工作之再評分技術：處罰多餘檢測之「匹配損失」以及鄰居之「關節處理」，用以得知附近是否有較佳檢測。A time series analysis of these three Loggies per subject during a period of time is performed (step 1808) to identify the gesture corresponding to the real event and the time it occurred. Non-Maximum Suppression (NMS) algorithms are used for this purpose. When an event is detected multiple times (from the same camera and from most cameras) by WhenCNN 1508 (ie, by dropping and removing an item of a subject), the NMS removes the redundant event for a subject. NMS is a re-scoring technique that includes two main tasks: a "matching loss" that penalizes redundant detection and a "joint processing" of neighbors to know if there is a better detection nearby.

針對各主體之取走及放下的真實事件係藉由以下方式而被進一步處理：計算在具有真實事件之該影像框前的30個影像框之SKU羅吉特的平均。最後，最大值之引數(縮寫為arg max或argmax)被用以判定最大值。由argmax值所分類的存貨項目被用以識別來自貨架之存貨項目放下或取走。存貨項目被加至各別主體之SKU(亦稱為購物車或籃)的對數，於步驟1810。程序步驟1804至1810被重複，假如有更多類別資料的話(於步驟1812檢查)。在一段時間週期期間，此處理導致對於各主體之購物車或籃的更新。該程序結束於步驟1814。具有場景及視頻程序之WhatCNNThe real events taken and dropped for each subject are further processed by: Calculating the average of the SKU Logitech's 30 image frames before the image frame with the real event. Finally, the maximum argument (abbreviated as arg max or argmax) is used to determine the maximum. The inventory item classified by the argmax value is used to identify the inventory item from the shelf that is dropped or removed. The inventory items are added to the logarithm of the SKU (also known as a shopping cart or basket) for each entity, at step 1810. Procedure steps 1804 to 1810 are repeated, if there is more category information (checked in step 1812). During a period of time, this process results in an update of the shopping cart or basket for each subject. The process ends at step 1814. WhatCNN with scenes and video programs

圖19提出系統之實施例，其中來自場景程序1415及視頻程序1411之資料被提供為對於WhatCNN模型1506之輸入以產生手影像類別。注意：各視頻程序之輸出被提供至分離的WhatCNN模型。來自場景程序1415之輸出為關節字典。於此字典中，密鑰為獨特關節識別符而值為該關節所關聯的獨特主體識別符。假如無任何主體與關節相關聯，則其不被包括於該字典中。各視頻程序1411從場景程序接收關節字典，並將其儲存入環緩衝器，其係將框數目映射至返回的字典。使用返回的密鑰-值字典，該視頻程序在各時刻選擇其接近與已識別主體關聯的手之影像的子集。於手關節周圍之影像框的這些部分可被稱為區提議。FIG. 19 presents an embodiment of the system in which data from the scene program 1415 and the video program 1411 are provided as input to a WhatCNN model 1506 to generate a hand image category. Note: The output of each video program is provided to a separate WhatCNN model. The output from the scene program 1415 is a joint dictionary. In this dictionary, the key is a unique joint identifier and the value is the unique subject identifier associated with the joint. If no subject is associated with a joint, it is not included in the dictionary. Each video program 1411 receives a joint dictionary from the scene program and stores it in a ring buffer, which maps the number of frames to the returned dictionary. Using the returned key-value dictionary, the video program selects at each moment its subset of images close to the hand associated with the identified subject. These parts of the image frame around the hand joints can be referred to as zone proposals.

於購物商店之範例中，區提議為來自一或更多相機(具有該主體於其相應觀看域中)之手位置的影像框。區提議係由系統中之每一相機所產生。其包括空手以及攜帶購物商店存貨項目和不屬於購物商店存貨之項目的手。視頻程序係選擇含有每時刻之手關節的影像框之部分。前台遮罩之類似片段被產生。以上(手關節之影像部分及前台遮罩)被序連與關節字典(指示各別手關節所屬之主體)以產生多維陣列。來自視頻程序之此輸出被提供為針對WhatCNN模型之輸入。In the example of a shopping store, the zone proposal is an image frame from the position of the hand of one or more cameras with the subject in its respective viewing area. Zone proposals are generated by each camera in the system. It includes empty hands as well as hands that carry items in shopping store inventory and items that are not part of shopping store inventory. The video program selects the part of the image frame containing the hand joints at each moment. A similar fragment of the foreground mask is generated. The above (the image part of the hand joint and the foreground mask) are sequentially connected to the joint dictionary (indicating the subject to which each hand joint belongs) to generate a multi-dimensional array. This output from the video program is provided as input to the WhatCNN model.

WhatCNN模型之分類結果被儲存於區提議資料結構(由視頻程序所產生)。針對一時刻之所有區被接著提供為對於場景程序之輸入。該場景程序將結果儲存於密鑰-值字典中，其中該密鑰為主體識別符而該值為密鑰-值字典，其中該密鑰為相機識別符而該值為區之羅吉特。此聚合資料結構被接著儲存於環緩衝器，其係將框數目映射至聚合結構於各時刻。具有場景及視頻程序之WhenCNNThe classification results of the WhatCNN model are stored in the region proposal data structure (generated by the video program). All zones for a moment are then provided as input to the scene program. The scenario program stores the result in a key-value dictionary, where the key is the subject identifier and the value is a key-value dictionary, where the key is the camera identifier and the value is the logit of the zone. This aggregate data structure is then stored in a ring buffer, which maps the number of frames to the aggregate structure at each time. WhenCNN with scenes and video programs

圖20提出系統之實施例，其中WhenCNN 1508接收來自場景程序之輸出，接續於由每視頻程序之WhatCNN模型所履行的手影像分類後，如圖19中所解釋。針對一段時間週期(例如，針對一秒)之區提議資料結構被提供為對於場景程序之輸入。於一實施例中，其中相機係以每秒30框之速率拍攝影像，該輸入包括30個時間週期及相應的區提議。場景程序對單一整數(其代表存貨項目SKU)減去30個區提議(每手)。場景程序之輸出為密鑰-值字典，其中該密鑰為主體識別符而該值為SKU整數。FIG. 20 presents an embodiment of the system in which WhenCNN 1508 receives output from a scene program, following the hand image classification performed by the WhatCNN model of each video program, as explained in FIG. 19. A zone proposal data structure for a period of time (eg, for one second) is provided as input to the scene program. In one embodiment, the camera captures images at a rate of 30 frames per second, and the input includes 30 time periods and corresponding zone proposals. The scenario program subtracts 30 zone proposals (per lot) from a single integer (which represents the inventory item SKU). The output of the scenario program is a key-value dictionary, where the key is the subject identifier and the value is a SKU integer.

WhenCNN模型1508履行時間序列分析以判定所時間經過之此字典的演化。如此導致從貨架所取走之項目以及放在購物商店中的貨架上之項目的識別。WhenCNN模型之輸出為密鑰-值字典，其中該密鑰為主體識別符而該值為由WhenCNN所產生之羅吉特。於一實施例中，一組啟發法2002被用以判定每主體之購物車資料結構1510。啟發法被應用至WhenCNN之輸出、由其各別關節資料結構所指示之主體的關節位置、及貨架圖。貨架圖為貨架上之存貨項目的預先計算的映圖。啟發法2002係判定(針對各取走或放下)該存貨項目是被放在貨架或是從貨架取走、該存貨項目是被放在購物車(或籃子)中或是從購物車(或籃子)取走、或者該存貨項目是否接近該已識別主體的身體。 What-CNN模型之範例架構WhenCNN model 1508 performs time series analysis to determine the evolution of this dictionary over time. This results in the identification of items taken from the shelves and items placed on the shelves in the shopping store. The output of the WhenCNN model is a key-value dictionary, where the key is the subject identifier and the value is the logit generated by WhenCNN. In one embodiment, a set of heuristics 2002 is used to determine the shopping cart data structure 1510 of each subject. Heuristics are applied to the output of WhenCNN, the joint positions of the subject as indicated by its respective joint data structure, and the plan view. Planograms are pre-calculated maps of items on the shelf. Heuristics 2002 determined (for each take or drop) whether the inventory item was placed on or removed from the shelf, whether the inventory item was placed in a shopping cart (or basket) ) Remove, or whether the inventory item is close to the body of the identified subject. What-CNN model example architecture

圖21提出WhatCNN模型1506之範例架構。於此範例架構中，有總共26個卷積層。亦提出有關各別寬度(像素)、高度(像素)及頻道數之不同層的維度。第一卷積層2113接收輸入2111且具有64像素之寬度、64像素之高度及具有64個頻道(寫入為64x64x64)。對於WhatCNN之輸入的細節被提出如上。箭號之方向係指示從一層至後續層之資料的流程。第二卷積層2115具有32x32x64之維度。由第二層所接續，有八個卷積層(顯示於方盒2117中)，各具有32x32x64之維度。只有兩層2119及2121被顯示於方盒2117中以利闡明之目的。此係接續以16x16x128之維度的另八個卷積層2123。兩個此卷積層2125及2127被顯示於圖21中。最後，最後八個卷積層2129，具有各8x8x256之維度。兩個卷積層2131及2133被顯示於方盒2129中以利闡明。Figure 21 presents an example architecture of a WhatCNN model 1506. In this example architecture, there are a total of 26 convolutional layers. The dimensions of different layers with respect to the respective width (pixel), height (pixel), and number of channels are also proposed. The first convolution layer 2113 receives the input 2111 and has a width of 64 pixels, a height of 64 pixels, and has 64 channels (written as 64x64x64). The details for the input of WhatCNN are presented above. The direction of the arrow indicates the flow of data from one layer to the next. The second convolution layer 2115 has a dimension of 32x32x64. Continuing from the second layer, there are eight convolutional layers (shown in a square box 2117), each with a dimension of 32x32x64. Only two layers 2119 and 2121 are shown in the square box 2117 for the purpose of clarification. This series is followed by another eight convolution layers 2123 with a dimension of 16x16x128. Two such convolutional layers 2125 and 2127 are shown in FIG. 21. Finally, the last eight convolutional layers 2129 have dimensions of 8x8x256 each. Two convolution layers 2131 and 2133 are shown in a square box 2129 to facilitate clarification.

有一完全連接層2135，具有來自最後卷積層2133之256個輸入，其產生N+5個輸出。如上所述，「N」為SKU之數目，其代表購物商店中供銷售之「N」個獨特存貨項目。五個額外羅吉特包括：第一羅吉特，其表示該影像中之項目為非SKU項目的信心、及第二羅吉特，其表示該主體是否持有項目的信心。接下來三個羅吉特係代表第一、第二及第三接近度類別，如上所述。WhatCNN之最後輸出被顯示於2137。範例架構係使用批次正規化(BN)。卷積神經網路(CNN)中之各層的分佈係於訓練期間改變且其隨著各層而變化。如此減少最佳化演算法之收斂速度。批次正規化(Ioffe及Szegedy 2015)係一種用以克服此問題之技術。ReLU(已校正的線性單元)啟動被用於各層的非線性，除了其中softmax被使用的最後輸出以外。There is a fully connected layer 2135 with 256 inputs from the last convolutional layer 2133, which produces N + 5 outputs. As mentioned above, "N" is the number of SKUs, which represents "N" unique inventory items for sale in a shopping store. The five additional logits include: the first logit, which indicates that the item in the image is a non-SKU item, and the second logit, which indicates whether the subject has confidence in the item. The next three Rockets represent the first, second, and third proximity categories, as described above. WhatCNN's final output is shown at 2137. The example architecture uses batch normalization (BN). The distribution of the layers in a convolutional neural network (CNN) changes during training and it changes with each layer. This reduces the convergence speed of the optimization algorithm. Batch normalization (Ioffe and Szegedy 2015) is a technique to overcome this problem. ReLU (corrected linear unit) enables non-linearities that are used for each layer, except for the final output where softmax is used.

圖22、23、及24為WhatCNN 1506之實施方式的不同部分之圖形視覺化。該些圖形被調適自其由TensorBoard™所產生之WhatCNN模型的圖形視覺化。TensorBoard™為用以檢視及理解深學習模型(例如，卷積神經網路)之視覺化工具的套件。Figures 22, 23, and 24 are graphical visualizations of different parts of an implementation of WhatCNN 1506. The graphics are adapted from the graphical visualization of the WhatCNN model produced by TensorBoard ™. TensorBoard ™ is a suite of visualization tools for viewing and understanding deep learning models such as convolutional neural networks.

圖22顯示其檢測單手(「單手」模型2210)之卷積神經網路模型的高階架構。WhatCNN模型1506包含兩個此卷積神經網路，用以各別地檢測左及右手。於所示的實施例中，該架構包括四個區塊，稱為區塊0 2216、區塊1 2218、區塊2 2220、及區塊3 2222。區塊為較高階抽象化且包含其代表卷積層之多數節點。該些區塊被配置於從較低至較高的序列以致其來自一區塊之輸出被輸入至後續區塊。該架構亦包括集用層2214及卷積層2212。於該些區塊之間，不同的非線性可被使用。於所示的實施例中，ReLU非線性被使用如上所述。Figure 22 shows a high-level architecture of a convolutional neural network model that detects one-handed ("one-handed" model 2210). WhatCNN model 1506 contains two of these convolutional neural networks to detect left and right hands separately. In the embodiment shown, the architecture includes four blocks, called block 0 2216, block 1 2218, block 2 2220, and block 3 2222. A block is a higher-level abstraction and contains most nodes that it represents a convolutional layer. The blocks are arranged in a sequence from lower to higher so that their output from one block is input to subsequent blocks. The architecture also includes a collection layer 2214 and a convolution layer 2212. Between these blocks, different non-linearities can be used. In the illustrated embodiment, ReLU nonlinearity is used as described above.

於所示的實施例中，對於單手模型2210之輸入為BxWxHxC張量，其被定義如上於WhatCNN 1506之描述中。「B」為批次大小，「W」及「H」係指示輸入影像的寬度及高度，而「C」為頻道之數目。單手模型2210之輸出係與第二單手模型結合且被傳遞至完全連接網路。In the illustrated embodiment, the input to the one-handed model 2210 is a BxWxHxC tensor, which is defined as described above in WhatCNN 1506. "B" is the batch size, "W" and "H" indicate the width and height of the input image, and "C" is the number of channels. The output of the one-handed model 2210 is combined with the second one-handed model and passed to the fully connected network.

於訓練期間，單手模型2210之輸出係與地面真相(ground truth)做比較。於該輸出與該地面真相之間所計算出的預測誤差被用以更新卷積層之加權。於所示的實施例中，隨機梯度下降(SGD)被用於訓練WhatCNN 1506。During training, the output of the one-handed model 2210 is compared with the ground truth. The prediction error calculated between the output and the ground truth is used to update the weighting of the convolutional layer. In the illustrated embodiment, Stochastic Gradient Descent (SGD) is used to train WhatCNN 1506.

圖23提出圖22之單手卷積神經網路模型的區塊0 2216之進一步細節。其包含四個卷積層，標示為方盒2310中之conv0、conv1 2318、conv2 2320、及conv3 2322。卷積層conv0之進一步細節被提出於方盒2310中。該輸入係由卷積層2312所處理。卷積層之輸出係由批次正規化層2314所處理。ReLU非線性2316被應用至批次正規化層2314之輸出。卷積層conv0之輸出被傳遞至下一層conv1 2318。最後卷積層conv3之輸出係透過加法運算2324而被處理。此運算係將來自層conv3 2322之輸出加總至其經歷跳躍連接2326之未修改輸入。其已由He等人發表於論文，名稱為「深殘餘網路中之識別映射」(發佈於https://arxiv.org/pdf/1603.05027.pdf，2016年七月25日)其向前及向後信號可被直接地從一區塊傳播至任何其他區塊。該信號未改變地傳播通過卷積神經網路。此技術增進了深卷積神經網路之訓練及測試性能。FIG. 23 presents further details of block 0 2216 of the one-handed convolutional neural network model of FIG. 22. It contains four convolutional layers, labeled conv0, conv1 2318, conv2 2320, and conv3 2322 in the square box 2310. Further details of the convolution layer conv0 are proposed in the square box 2310. This input is processed by the convolutional layer 2312. The output of the convolutional layer is processed by the batch normalization layer 2314. ReLU nonlinearity 2316 is applied to the output of the batch normalization layer 2314. The output of the convolution layer conv0 is passed to the next layer conv1 2318. The output of the final convolution layer conv3 is processed through the addition of 2324. This operation sums the output from layer conv3 2322 to its unmodified input that has undergone a jump connection 2326. It has been published in a paper by He et al. Entitled "Identification Mapping in Deep Residual Networks" (published at https://arxiv.org/pdf/1603.05027.pdf, July 25, 2016). Backward signals can be propagated directly from one block to any other block. This signal propagates unchanged through the convolutional neural network. This technique improves the training and testing performance of deep convolutional neural networks.

如圖21中所述，WhatCNN之卷積層的輸出係由完全連接層所處理。兩單手模型2210之輸出被結合並傳遞為輸入而至完全連接層。圖24為完全連接層(FC)2410之範例實施方式。對於FC層之輸入係由再成形運算子2412所處理。再成形運算子係改變張量之形狀，在將其傳遞至下一層2420之前。再成形包括將來自卷積層之輸出平坦化，亦即，將來自多維矩陣之輸出再成形至一維矩陣或向量。再成形運算子2412之輸出被傳遞至矩陣乘法運算子，其被標示為MatMul 2422。來自MatMul 2422之輸出被傳遞至矩陣正加法運算子，其被標示為xw_plus_b 2424。針對各輸入「x」，運算子2424將輸入乘以矩陣「w」及向量「b」以產生該輸出。「w」為與輸入「x」關聯的可訓練參數，而「b」為其被稱為偏移或攔截之另一可訓練參數。來自完全連接層2410之輸出2426為BxL張量，如以上於WhatCNN 1506之描述中所解釋。「B」為批次大小，而「L=N+5」為每影像框之羅吉特輸出的數目。「N」為SKU之數目，其代表購物商店中供銷售之「N」個獨特存貨項目。 WhatCNN模型之訓練As described in Figure 21, the output of the convolutional layer of WhatCNN is processed by a fully connected layer. The outputs of the two one-handed models 2210 are combined and passed as input to the fully connected layer. FIG. 24 is an example implementation of a fully connected layer (FC) 2410. The input to the FC layer is processed by a reforming operator 2412. The reshaping operator changes the shape of the tensor before passing it to the next layer 2420. Reshaping includes flattening the output from the convolutional layer, that is, reshaping the output from the multi-dimensional matrix to a one-dimensional matrix or vector. The output of the reforming operator 2412 is passed to a matrix multiplication operator, which is labeled as MatMul 2422. The output from MatMul 2422 is passed to the matrix positive addition operator, which is labeled xw_plus_b 2424. For each input "x", the operator 2424 multiplies the input by the matrix "w" and the vector "b" to produce the output. "W" is a trainable parameter associated with the input "x", and "b" is another trainable parameter called offset or intercept. The output 2426 from the fully connected layer 2410 is a BxL tensor, as explained above in the description of WhatCNN 1506. "B" is the batch size, and "L = N + 5" is the number of logit outputs per image frame. "N" is the number of SKUs, which represents "N" unique inventory items for sale in the shopping store. WhatCNN model training

於不同背景下持有不同存貨項目之手、以及於不同背景下之空手的影像之訓練資料集被產生。為了達成此目的，人類演員係持有各獨特的SKU存貨項目以多數不同方式，於測試環境之不同位置上。其手之背景的範圍包含：接近於演員的身體、接近於商店的貨架、及接近於演員的購物車或籃。該演員亦以空手履行上述動作。此程序被完成於左及右手兩者。多數演員係同時地履行這些動作於相同的測試環境中以模擬其發生在真實購物商店中之自然阻擋。Training data sets of hands holding different inventory items in different backgrounds and images of empty hands in different backgrounds are generated. To achieve this, human actors hold various unique SKU inventory items in many different ways at different locations in the test environment. The scope of his background includes: close to the actor's body, close to the shelves of the store, and close to the actor's shopping cart or basket. The actor also performed the above action empty-handed. This procedure is done for both left and right hands. Most actors perform these actions simultaneously in the same test environment to simulate their natural blocking in a real shopping store.

相機114拍攝其履行上述動作之演員的影像。於一實施例中，二十個相機被使用於此程序。關節CNN 112a-112n及追蹤引擎110係處理該些影像以識別關節。定界框產生器1504產生類似於生產或推理之手區的定界框。取代經由WhatCNN 1506以分類這些手區，該些影像被存至儲存碟。已儲存影像被檢視並標示。影像被指派三個標籤：存貨項目SKU、背景、及該手是否持有某東西。此程序係針對大量影像(高達數百萬影像)而被履行。The camera 114 takes an image of an actor performing the above-mentioned actions. In one embodiment, twenty cameras are used in this program. The joint CNNs 112a-112n and the tracking engine 110 process the images to identify the joints. The bounding box generator 1504 generates a bounding box similar to a hand region of production or inference. Instead of classifying these hand regions via WhatCNN 1506, the images are saved to a storage disk. Saved images are viewed and marked. The image is assigned three tags: the inventory item SKU, the background, and whether the hand holds something. This procedure is performed for a large number of images (up to millions of images).

影像檔係依據資料收集場景而被組織。針對影像檔之命名約定係識別該些影像之內容及背景。圖25顯示一範例實施例中之影像檔名。檔名之第一部分(以數字2502指稱)係識別資料集合場景且亦包括該影像之時戳。檔名之第二部分2504係識別來源相機。於圖25所示之範例中，影像係由「相機4」所擷取。檔名之第三部分2506係識別來自來源相機之框數。於所示之範例中，檔名係指示其為來自相機4之第94,600影像框。檔名之第四部分2508係識別來源影像框(此手區影像係從該來源影像框所取得)中之x及y座標區的範圍。於所示之範例中，該區被界定於從像素117至370的x座標值與從像素370至498的y座標值之間。檔名之第五部分2510係識別該場景中之演員的個人id。於所示之範例中，該場景中之個人具有id「3」。最後，檔名之第六部分2512係識別存貨項目之SKU數(項目=68)，識別於該影像中。The image files are organized according to the data collection scene. The naming convention for image files is to identify the content and background of the images. FIG. 25 shows an image file name in an exemplary embodiment. The first part of the filename (referred to by the number 2502) identifies the scene of the data collection and also includes the time stamp of the image. The second part of the file name, 2504, identifies the source camera. In the example shown in Figure 25, the image is captured by "Camera 4". The third part of the file name 2506 identifies the number of frames from the source camera. In the example shown, the filename indicates that it is the 94,600 image frame from camera 4. The fourth part of the file name 2508 is to identify the range of the x and y coordinate areas in the source image frame (this hand area image is obtained from the source image frame). In the example shown, the region is defined between the x-coordinate values from pixels 117 to 370 and the y-coordinate values from pixels 370 to 498. The fifth part of the file name 2510 identifies the personal id of the actor in the scene. In the example shown, the individual in the scene has the id "3". Finally, the sixth part 2512 of the file name identifies the SKU number of the inventory item (item = 68), which is identified in the image.

於WhatCNN 1506之訓練模式中，前向傳遞及後向傳播被履行相反於產生模式，其中僅有前向傳遞被履行。於訓練期間，WhatCNN產生該些已識別主體之手的類別於前向傳遞中。WhatCNN之輸出係與地面真相進行比較。於後向傳播中，一或更多成本函數之梯度被計算。梯度被接著傳播至卷積神經網路(CNN)及完全連接(FC)神經網路以致其預測誤差被減少，造成輸出更接近於地面真相。於一實施例中，隨機梯度下降(SGD)被用於訓練WhatCNN 1506。In the training mode of WhatCNN 1506, forward pass and backward pass are performed as opposed to generation mode, of which only forward pass is performed. During training, WhatCNN generates classes of the identified subjects' hands in forward pass. WhatCNN's output is compared to ground truth. In backward propagation, the gradient of one or more cost functions is calculated. The gradient is then propagated to the convolutional neural network (CNN) and the fully connected (FC) neural network so that its prediction error is reduced, causing the output to be closer to the ground truth. In one embodiment, Stochastic Gradient Descent (SGD) is used to train WhatCNN 1506.

於一實施例中，64個影像被隨機地選自訓練資料並被擴增。影像擴增之目的係用以使訓練資料多樣化，其導致模型之較佳性能。影像擴增包括影像之隨機翻轉、隨機旋轉、隨機色相移位、隨機高斯雜訊、隨機對比改變、及隨機修剪。擴增之量為超參數且透過超參數搜尋而被調諧。已擴增影像係由WhatCNN 1506所分類，於訓練期間。該分類係與地面真相進行比較，且WhatCNN 1506之係數或加權係藉由計算梯度損失函數並將梯度乘以學習速率而被更新。上述程序被重複多次(例如，約1000次)以形成時期。介於50至200時期之間被履行。於各時期期間，學習速率被稍微地減少，依循餘弦退火排程。 WhenCNN模型之訓練In one embodiment, 64 images are randomly selected from the training data and amplified. The purpose of image augmentation is to diversify the training data, which leads to better performance of the model. Image augmentation includes random flip, random rotation, random hue shift, random Gaussian noise, random contrast change, and random trimming. The amount of amplification is a hyperparameter and is tuned by a hyperparameter search. The amplified images were classified by WhatCNN 1506 during training. This classification is compared to ground truth, and the coefficients or weights of WhatCNN 1506 are updated by calculating the gradient loss function and multiplying the gradient by the learning rate. The above procedure is repeated multiple times (for example, about 1000 times) to form a period. Fulfilled between 50 and 200 periods. During each period, the learning rate is slightly reduced, following the cosine annealing schedule. WhenCNN model training

WhenCNN 1508之訓練係類似於上述的WhatCNN 1506，使用後向傳播以減少預測誤差。演員履行多種動作於訓練環境中。於範例實施例中，訓練被履行於購物商店中，以其貨架堆疊有存貨項目。由演員所履行之動作的範例包括：從貨架取走存貨項目、將存貨項目放回貨架上、將存貨項目放入購物車(或籃)中、從購物車取回存貨項目、於左與右手之間調換項目、將存貨項目放入演員的隱蔽處中。隱蔽處是指稱其可在左與右手旁邊持有存貨項目之演員的身體上之位置。隱蔽處之某些範例包括：一存貨項目，其係擠壓於前臂與上臂之間、擠壓於前臂與胸口之間、擠壓於脖子與肩膀之間。WhenCNN 1508 is trained similar to WhatCNN 1506 described above, using backward propagation to reduce prediction errors. Actors perform a variety of actions in a training environment. In the exemplary embodiment, the training is performed in a shopping store with inventory items stacked on its shelves. Examples of actions performed by actors include: removing inventory items from shelves, putting inventory items back on shelves, putting inventory items in shopping carts (or baskets), retrieving inventory items from shopping carts, left and right hands Switch items between, and put inventory items into the actor's hideout. Hidden place refers to the position on the body of an actor who claims to have inventory items beside his left and right hands. Some examples of hidden places include: an inventory item that is squeezed between the forearm and upper arm, between the forearm and chest, and between the neck and shoulders.

相機114係記錄於訓練期間如上所述之所有動作的視頻。該些視頻被檢視且所有影像框被標示以指示時戳及所履行的動作。這些標籤被稱為針對各別影像框之動作標籤。該些影像框係透過多CNN管線而被處理直達WhatCNN 1506如上所述，以供產生或推理。WhatCNN(連同相關的動作標籤)之輸出被接著用以訓練WhenCNN 1508，以該些動作標籤作用為地面真相。具有餘弦退火排程之隨機梯度下降(SGD)被用於如上所述之訓練以供WhatCNN 1506之訓練。The camera 114 is a video recording all the actions described above during the training. The videos are viewed and all video frames are marked to indicate time stamps and actions performed. These tags are called action tags for individual video frames. These image frames are processed through the multiple CNN pipelines to WhatCNN 1506 as described above for generation or inference. The output of WhatCNN (along with the associated action labels) is then used to train WhenCNN 1508 to use these action labels as ground truth. Stochastic gradient descent (SGD) with cosine annealing schedule is used for training as described above for training of WhatCNN 1506.

除了影像擴增(用於WhatCNN之訓練)以外，時間擴增亦被應用至影像框，於WhenCNN之訓練期間。一些範例包括鏡射、加入高斯雜訊、調換與左及右手相關的羅吉特、縮短時間、藉由丟棄影像框以縮短時間序列、藉由複製框以延長時間序列、及丟棄時間序列中之資料點以模擬基礎模型(其產生用於WhenCNN之輸入)中之缺陷。鏡射包括反轉時間序列及各別標籤，例如，當被反轉時放下動作變為取走動作。使用背景影像處理以預測存貨事件In addition to image augmentation (for WhatCNN training), time augmentation is also applied to image frames during the training of WhenCNN. Some examples include mirroring, adding Gaussian noise, swapping left and right-handed logits, shortening time, shortening time series by discarding image frames, extending time series by copying frames, and discarding time series. Data points are used to simulate defects in the base model, which generates inputs for WhenCNN. Mirroring includes reversing the time series and individual labels, for example, when the action is reversed, the action is changed to the take action. Use background image processing to predict inventory events

用以追蹤真實空間之區域中藉由主體之改變的系統及各種實施方式係參考圖26至28B而被描述。系統架構A system and various embodiments for tracking changes by a subject in an area of real space are described with reference to FIGS. 26 to 28B. system structure

圖26提出依據一實施方式之一種系統的高階概圖。因為圖26為架構圖，所以某些細節被省略以增進描述之清晰。FIG. 26 presents a high-level overview of a system according to an embodiment. Because FIG. 26 is an architectural diagram, some details are omitted to improve the clarity of the description.

圖26中所提出之系統係接收來自複數相機114之影像框。如上所述，於一實施例中，相機114可於時間上被彼此同步化，以致其影像被同時地(或時間上接近地)並以相同的影像擷取率來擷取。於其同時地(或時間上接近地)覆蓋真實空間之區域的所有相機中所擷取的影像被同步化，由於其同步化影像可被識別於處理引擎中如代表在具有真實空間中之固定位置的主體之某一時刻的不同視角。The system proposed in FIG. 26 receives image frames from a plurality of cameras 114. As described above, in one embodiment, the cameras 114 may be synchronized with each other in time, so that their images are captured simultaneously (or close in time) and at the same image capture rate. The images captured by all the cameras that simultaneously (or close in time) cover the area of real space are synchronized, because the synchronized images can be identified in the processing engine, such as representing a fixed in real space Different perspectives of the subject at a moment in position.

於一實施例中，相機114被安裝於購物商店(諸如超級市場)中以致其具有重疊觀看域之相機(二或更多)的集合被置於各走道上方以擷取該商店中之真實空間的影像。有「n」個相機於真實空間中。各相機係產生相應於其各別觀看域之真實空間的影像之序列。In one embodiment, the camera 114 is installed in a shopping store (such as a supermarket) so that its collection of cameras (two or more) with overlapping viewing areas is placed above each aisle to capture the real space in the store Image. There are "n" cameras in real space. Each camera produces a sequence of images corresponding to the real space of its respective viewing field.

主體識別子系統2602(亦稱為第一影像處理器)係處理接收自相機114之影像框以識別並追蹤真實空間中之主體。第一影像處理器包括主體影像辨識引擎。主體影像辨識引擎接收來自複數相機之影像的相應序列，並處理影像以識別影像的該相應序列中之影像所表示的主體。於一實施例中，該系統包括如上所述之每相機影像辨識引擎，用以識別並追蹤多關節主體。替代的影像辨識引擎可被使用，包括其中僅有一「關節」被辨識並追蹤於每個體之範例，或者涵蓋空間及時間之其他特徵或其他類型的影像資料被利用以辨識並追蹤其被處理的真實空間中之主體。The subject recognition subsystem 2602 (also referred to as the first image processor) processes image frames received from the camera 114 to identify and track subjects in real space. The first image processor includes a subject image recognition engine. The subject image recognition engine receives a corresponding sequence of images from a plurality of cameras, and processes the images to identify a subject represented by the images in the corresponding sequence of images. In one embodiment, the system includes a per-camera image recognition engine as described above for identifying and tracking multi-joint subjects. Alternative image recognition engines can be used, including examples where only one "joint" is recognized and tracked in each volume, or other features covering space and time or other types of image data are used to identify and track the processed Subject in real space.

「語意差異」子系統2604(亦稱為第二影像處理器)包括背景影像辨識引擎，其係接收來自該些複數相機之影像的相應序列並語意地辨識背景(亦即，如貨架等存貨展示結構)中之重要差異，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。第二影像處理器係接收主體識別子系統2602之輸出及來自相機114之影像框以當作輸入。第二影像處理器係遮蔽該前台中之該些已識別主體以產生已遮蔽影像。該些已遮蔽影像係藉由以背景影像資料取代其與前台主體相應的定界框來產生。接續於此，該些背景影像辨識引擎係處理該些已遮蔽影像以識別並分類影像之該些相應序列中的該些影像中所表示之背景改變。於一實施例中，該些背景影像辨識引擎包含卷積神經網路。The "semantic difference" subsystem 2604 (also known as the second image processor) includes a background image recognition engine that receives corresponding sequences of images from the plurality of cameras and semantically identifies the background (ie, inventory displays such as shelves Structure), as it relates to, for example, the dropping and removal of inventory items over time in images from various cameras. The second image processor receives the output of the subject recognition subsystem 2602 and the image frame from the camera 114 as inputs. The second image processor masks the identified subjects in the foreground to generate a masked image. The occluded images are generated by replacing the bounding box corresponding to the foreground subject with background image data. Continuing here, the background image recognition engines process the occluded images to identify and classify the background changes represented in the images in the corresponding sequences of the images. In one embodiment, the background image recognition engines include a convolutional neural network.

最後，第二影像處理器係處理已識別背景改變以進行由已識別主體取走存貨項目的檢測及由已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合。檢測之第一集合亦被稱為存貨項目之放下及取走的背景檢測。於購物商店之範例中，第一檢測係識別由消費者或商店之員工從貨架取走或放在貨架上的存貨項目。該語意差異子系統包括用以使已識別背景改變與已識別主體關聯的邏輯。Finally, the second image processor processes the first set of detected background changes for detection by the identified subject to remove the inventory item and detection by the identified subject to drop the inventory item on the inventory display structure. The first set of inspections is also referred to as background inspection of the drop and take away of inventory items. In the example of a shopping store, the first detection is to identify inventory items that have been removed from or placed on a shelf by a consumer or store employee. The semantic difference subsystem includes logic to associate an identified background change with an identified subject.

區提議子系統2606(亦稱為第三影像處理器)包括前台影像辨識引擎，其係接收來自該些複數相機114之影像的相應序列，並語意地辨識前台(亦即，購物者、其手以及存貨項目)中之重要物件，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。子系統2606亦接收主體識別子系統2602之輸出。該些第三影像處理器係處理來自相機114之影像的序列以識別並分類影像之該些相應序列中的該些影像中所表示之前台改變。第三影像處理器係處理已識別前台改變以進行由已識別主體取走存貨項目的檢測及由已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合。檢測之第二集合亦被稱為存貨項目之放下及取走的前台檢測。於購物商店之範例中，檢測之第二集合係識別由消費者及商店之員工取走存貨項以及將存貨項目放在存貨展示結構上。The area proposal subsystem 2606 (also known as the third image processor) includes a front-end image recognition engine that receives corresponding sequences of images from the plurality of cameras 114 and semantically identifies the front-end (i.e., shoppers, And inventory items) because they are related to, for example, dropping and removing inventory items over time in images from various cameras. Subsystem 2606 also receives output from the subject identification subsystem 2602. The third image processors process a sequence of images from the camera 114 to identify and classify anterior stage changes represented in the images in the corresponding sequences of the images. The third image processor is the second set of processing the identified foreground changes to detect the inventory items taken by the identified subject and the detection of the inventory items dropped by the identified subject on the inventory display structure. The second set of inspections is also referred to as front-end inspection of drop and take-off of inventory items. In the example of a shopping store, the second set of detections identifies the removal of inventory items by consumers and store employees and placing the inventory items on the inventory display structure.

圖26中所述之系統包括選擇邏輯組件2608，用以處理檢測之第一及第二集合來產生包括已識別主體之存貨項目的列表之日誌資料結構。針對真實空間中之取走或放下，選擇邏輯2608係選擇來自語意差異子系統2604或區提議子系統2606之任一者的輸出。於一實施例中，選擇邏輯2608係使用由語意差異子系統針對檢測之第一集合所產生的信心分數以及由區提議子系統針對檢測之第二集合所產生的信心分數來進行選擇。針對特定檢測具有較高信心分數之子系統的輸出被選擇並使用以產生日誌資料結構1510(亦稱為購物車資料結構)，其包括與已識別前台主體關聯的存貨項目之列表。子系統組件The system described in FIG. 26 includes a selection logic component 2608 for processing the first and second sets of inspections to generate a log data structure including a list of inventory items for identified entities. For removal or drop in real space, the selection logic 2608 selects the output from either the semantic difference subsystem 2604 or the zone proposal subsystem 2606. In one embodiment, the selection logic 2608 uses the confidence score generated by the semantic difference subsystem for the first set of detections and the confidence score generated by the district proposal subsystem for the second set of detections. The output of the subsystem with a higher confidence score for a particular test is selected and used to generate a log data structure 1510 (also known as a shopping cart data structure) that includes a list of inventory items associated with the identified front-end subjects. Subsystem components

圖27提出子系統組件，其係實施該系統以追蹤藉由真實空間之區域中的主體之改變。該系統包含複數相機114，其係產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與如上所述的該些複數相機中之至少一其他相機的該觀看域重疊。於一實施例中，相應於由該些複數相機114所產生之影像的影像框的序列被儲存在每相機114之循環緩衝器1502(亦稱為環緩衝器)中。各影像框具有時戳、相機之識別(縮寫為「相機_id」)、及框識別(縮寫為「框_id」)，連同影像資料。循環緩衝器1502儲存來自各別相機114之連續有時戳影像框之集合。於一實施例中，相機114被組態成產生影像之同步化序列。Figure 27 presents a subsystem component that implements the system to track changes to subjects in an area through real space. The system includes a plurality of cameras 114 that generate respective sequences of images of corresponding viewing fields in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera among the plurality of cameras as described above. In an embodiment, a sequence of image frames corresponding to the images generated by the plurality of cameras 114 is stored in a circular buffer 1502 (also referred to as a ring buffer) of each camera 114. Each image frame has a time stamp, camera identification (abbreviated as "camera_id"), and frame identification (abbreviated as "frame_id"), along with image data. The circular buffer 1502 stores a collection of consecutive time stamped image frames from the respective cameras 114. In one embodiment, the camera 114 is configured to generate a synchronized sequence of images.

相同的相機及影像之相同的序列係由一較佳實施方式中之前台及背景影像處理器兩者所使用。結果，存貨項目之放下及取走的冗餘檢測係使用相同的輸入資料而被執行，以容許高信心(及高準確度)於所得資料中。The same camera and the same sequence of images are used by both the front stage and the background image processor in a preferred embodiment. As a result, the redundant detection of put-down and take-out of inventory items is performed using the same input data to allow high confidence (and high accuracy) in the obtained data.

主體識別子系統2602(亦稱為第一影像處理器)包括主體影像辨識引擎，接收來自該些複數相機114之影像的相應序列。該主體影像辨識引擎係處理影像以識別影像之該些相應序列中的該些影像中所表示之主體。於一實施例中，該主體影像辨識引擎被實施為卷積神經網路(CNN)，其被稱為關節CNN 112a-112n。相應於具有重疊觀看域之相機的關節CNN 112a-112n之輸出被結合以將來自各相機之2D影像座標的關節之位置映射至真實空間之3D座標。每主體(j)之關節資料結構800(其中j等於1至x)識別真實空間中以及各影像的2D空間中之主體(j)的關節之位置。主體資料結構800之某些細節被提出於圖8中。The subject recognition subsystem 2602 (also referred to as the first image processor) includes a subject image recognition engine that receives a corresponding sequence of images from the plurality of cameras 114. The subject image recognition engine processes the images to identify subjects represented in the images in the corresponding sequences of the images. In one embodiment, the subject image recognition engine is implemented as a Convolutional Neural Network (CNN), which is referred to as a joint CNN 112a-112n. The outputs of the joints CNN 112a-112n corresponding to cameras with overlapping viewing fields are combined to map the positions of the joints from the 2D image coordinates of each camera to the 3D coordinates of real space. The joint data structure 800 (where j is equal to 1 to x) of each subject (j) identifies the positions of the joints of the subject (j) in the real space and in the 2D space of each image. Some details of the master data structure 800 are presented in FIG. 8.

語意差異子系統2604中之背景影像儲存2704係儲存針對來自相機114之影像的相應序列之已遮蔽影像(亦稱為背景影像，其中前台主體已藉由遮蔽而被移除)。背景影像儲存2704亦稱為背景緩衝器。於一實施例中，已遮蔽影像之大小係相同於循環緩衝器1502中的影像框之大小。於一實施例中，已遮蔽影像被儲存在背景影像儲存2704中，其係相應於每相機之影像框的序列中之各影像框。The background image storage 2704 in the semantic difference subsystem 2604 stores occluded images (also referred to as background images, where the foreground subject has been removed by occlusion) for a corresponding sequence of images from the camera 114. The background image storage 2704 is also called a background buffer. In one embodiment, the size of the masked image is the same as the size of the image frame in the circular buffer 1502. In one embodiment, the occluded image is stored in a background image storage 2704, which is an image frame in a sequence corresponding to the image frame of each camera.

語意差異子系統2604(或第二影像處理器)包括遮罩產生器2724，其係產生來自相機之影像的相應序列中之影像所表示的前台主體之遮罩。於一實施例中，一遮罩產生器係處理每相機之影像的序列。於購物商店之範例中，前台主體是在含有供銷售之項目的背景貨架前方之消費者或商店的員工。The semantic difference subsystem 2604 (or the second image processor) includes a mask generator 2724 that generates a mask of the foreground subject represented by the images in the corresponding sequence of images from the camera. In one embodiment, a mask generator processes a sequence of images from each camera. In the example of a shopping store, the front-end subject is a consumer or store employee in front of a background shelf containing items for sale.

於一實施例中，關節資料結構800及來自循環緩衝器1502之影像框被提供為針對遮罩產生器2724之輸入。關節資料結構係識別各影像框中之前台主體的位置。遮罩產生器2724產生影像框中所識別之每前台主體的定界框。於此一實施例中，遮罩產生器2724係使用2D影像框中之關節位置的x及y座標以判定定界框之四個邊界。x(來自針對一主體之關節的所有x值)之最小值係界定該主體之定界框的左垂直邊界。y(來自針對一主體之關節的所有y值)之最小值係界定定界框的底部垂直邊界。同樣地，x及y座標之最大值係識別定界框之右垂直及頂部水平邊界。於第二實施例中，遮罩產生器2724係使用卷積神經網路為基的人檢測及局部化演算法以產生前台主體之定界框。於此一實施例中，遮罩產生器2724不使用關節資料結構800以產生前台主體之定界框。In one embodiment, the joint data structure 800 and the image frame from the circular buffer 1502 are provided as inputs to the mask generator 2724. The joint data structure identifies the position of the front stage main body in each image frame. The mask generator 2724 generates a bounding frame for each foreground subject identified in the image frame. In this embodiment, the mask generator 2724 uses the x and y coordinates of the joint positions in the 2D image frame to determine the four boundaries of the bounding frame. The minimum value of x (from all x values for a subject's joints) is the left vertical boundary that defines the bounding box of the subject. The minimum value of y (from all y values for joints of a subject) defines the bottom vertical boundary of the bounding box. Similarly, the maximum values of the x and y coordinates identify the right vertical and top horizontal boundaries of the bounding box. In the second embodiment, the mask generator 2724 uses a convolutional neural network-based person detection and localization algorithm to generate the bounding box of the foreground subject. In this embodiment, the mask generator 2724 does not use the joint data structure 800 to generate the bounding box of the foreground subject.

語意差異子系統2604(或第二影像處理器)包括遮罩邏輯，用以處理影像之該些序列中的影像而以來自影像之該些相應序列的背景影像之背景影像資料取代其代表該些已識別主體之前台影像資料，以提供已遮蔽影像，其導致新背景影像以供處理。當循環緩衝器接收來自相機114之影像框時，遮罩邏輯係處理影像之該些序列中的影像來以背景影像資料取代由影像遮罩所界定的前台影像資料。該背景影像資料被取自影像之該些相應序列的該些背景影像以產生該些相應已遮罩影像。The semantic difference subsystem 2604 (or the second image processor) includes masking logic to process the images in the sequences of the image and replace the background image data representing the background images from the corresponding sequences of the image with the background image data The subject's front stage image data has been identified to provide a masked image, which results in a new background image for processing. When the circular buffer receives the image frame from the camera 114, the mask logic processes the images in those sequences of the image to replace the foreground image data defined by the image mask with the background image data. The background image data is taken from the background images of the corresponding sequences of the image to generate the corresponding masked images.

考量購物商店之範例。一開始，於時間t =0，當商店中沒有消費者時，背景影像儲存2704中之背景影像係相同於每相機之影像的該些序列中之其相應影像框。現在考量時間t =1，消費者係於貨架前方移動以購買該貨架中之項目。遮罩產生器2724係產生該消費者之定界框並將其傳送至遮罩邏輯組件2702。遮罩邏輯組件2702係藉由在t =0之該背景影像框中的相應像素來取代該定界框內部在t =1之該影像框中的像素。此係導致相應於循環緩衝器1502中在t =1之該影像框的在t =1之已遮蔽影像。已遮蔽影像不包括針對前台主體(或消費者)之像素，係現在係由來自在t =0之該背景影像框的像素所取代。在t =1之已遮蔽影像被儲存於背景影像儲存2704中並作用為來自相應相機的影像之該些序列中在t =2之下個影像框的背景影像。Consider the example of a shopping store. Initially, at time t = 0, when there are no consumers in the store, the background image in the background image storage 2704 is the same as its corresponding image frame in the sequences of each camera image. Now consider time t = 1, the consumer moves in front of the shelf to purchase items in the shelf. The mask generator 2724 generates the bounding box of the consumer and transmits it to the mask logic component 2702. The mask logic component 2702 replaces the pixels in the bounding frame with the corresponding pixels in the image frame at t = 1 by corresponding pixels in the background image frame at t = 0. This results in a corresponding line in the circular buffer 1502 in the image frame t = 1 t = 1 of the image has been masked. The occluded image does not include pixels for the foreground subject (or consumer), and is now replaced by pixels from the background image frame at t = 0. The masked image at t = 1 is stored in the background image storage 2704 and acts as the background image of the image frames below t = 2 in those sequences of images from the corresponding camera.

於一實施例中，遮罩邏輯組件2702係結合(諸如藉由以像素來平均或加總)影像之該些序列中的多組N個已遮蔽影像以產生針對各相機之因數化影像的序列。於此一實施例中，該些第二影像處理器藉由處理因數化影像的該序列以識別並分類背景改變。因數化影像可(例如)藉由取得每相機的已遮蔽影像之該序列中的N個已遮蔽影像中之像素的平均值來產生。於一實施例中，N之值係等於相機114之框率，例如，假如框率為30 FPS(每秒之框)，則N之值為30。於此一實施例中，針對一秒之時間週期的已遮蔽影像被結合以產生因數化影像。取得平均像素值係將像素波動減至最小，由於真實空間之區域中的感應器雜訊及發光度改變。In one embodiment, the mask logic component 2702 combines (such as by averaging or summing by pixels) a plurality of sets of N masked images in the sequences of images to generate a sequence of factorized images for each camera . In this embodiment, the second image processors process the factorized image sequence to identify and classify background changes. The factorized image can be generated, for example, by taking the average of the pixels in the N masked images in the sequence of masked images per camera. In one embodiment, the value of N is equal to the frame rate of the camera 114. For example, if the frame rate is 30 FPS (frame per second), the value of N is 30. In this embodiment, the masked images for a time period of one second are combined to generate a factorized image. Obtaining average pixel values is to minimize pixel fluctuations due to sensor noise and luminosity changes in areas of real space.

該些第二影像處理器藉由處理因數化影像的該序列以識別並分類背景改變。因數化影像之該些序列中的因數化影像係藉由位元遮罩計算器2710而與該相同相機之先前因數化影像進行比較。因數化影像2706之對被提供為輸入至位元遮罩計算器2710以產生位元遮罩，其係識別兩因數化影像之相應像素中的改變。該位元遮罩具有1於像素位置，其中介於相應像素(目前及先前因數化影像)的RGB(紅、綠及藍頻道)值之間的差異係大於「差異臨限值」。該差異臨限值之值是可調整的。於一實施例中，該差異臨限值之值被設於0.1。The second image processors process the sequence of factorized images to identify and classify background changes. The factorized images in the sequences of factorized images are compared with the previous factorized images of the same camera by the bit mask calculator 2710. The pair of factorized images 2706 is provided as input to a bitmask calculator 2710 to generate a bitmask, which identifies changes in the corresponding pixels of the two factorized images. The bit mask has a pixel position of 1, where the difference between the RGB (red, green, and blue channel) values of the corresponding pixels (current and previous factorized images) is greater than the "difference threshold." The difference threshold value is adjustable. In one embodiment, the difference threshold is set at 0.1.

來自每相機之因數化影像的序列之該位元遮罩及該對因數化影像(目前及先前)被提供為輸入至背景影像辨識引擎。於一實施例中，背景影像辨識引擎包含卷積神經網路且被稱為改變CNN 2714a-2714n。單一改變CNN係處理每相機之因數化影像的序列。於另一實施例中，來自影像之相應序列的已遮蔽影像不被結合。該位元遮罩被計算自已遮蔽影像之該些對。於此實施例中，已遮蔽影像之該些對及該位元遮罩被接著提供為輸入至該改變CNN。The bit mask and the pair of factorized images (current and previous) from the sequence of factorized images from each camera are provided as inputs to a background image recognition engine. In one embodiment, the background image recognition engine includes a convolutional neural network and is referred to as a change CNN 2714a-2714n. A single change CNN is a sequence of factorized images per camera. In another embodiment, the masked images from the corresponding sequence of images are not combined. The bit mask is calculated from the pairs that have masked the image. In this embodiment, the pairs of masked images and the bit mask are then provided as inputs to the changed CNN.

於此範例中對於改變CNN模型之輸入係由七(7)個頻道所組成，包括每因數化影像之三個影像頻道(紅、綠及藍)及針對該位元遮罩之一頻道。該改變CNN包含多數卷積層及一或更多完全連接(FC)層。於一實施例中，該改變CNN包含如圖5中所示之關節CNN 112a-112n的相同數目的卷積及FC層。In this example, the input for changing the CNN model is composed of seven (7) channels, including three image channels (red, green, and blue) per factorized image and one channel for the bit mask. The modified CNN includes a majority of convolutional layers and one or more fully connected (FC) layers. In one embodiment, the modified CNN includes the same number of convolutions and FC layers as the joint CNNs 112a-112n shown in FIG. 5.

背景影像辨識引擎(改變CNN 2714a-2714n)係識別並分類該些因數化影像中之改變且產生針對影像之該些相應序列的改變資料結構。該些改變資料結構包括已識別背景改變之已遮蔽影像中的座標、該些已識別背景改變之存貨項目主體的識別符及該些已識別背景改變之類別。該些改變資料結構中之已識別背景改變的該些類別係分類該已識別存貨項目是否已相對於該背景影像而被加入或移除。The background image recognition engine (change CNN 2714a-2714n) recognizes and classifies changes in the factorized images and generates a changed data structure for the corresponding sequences of the images. The changed data structure includes the coordinates in the occluded image of the identified background change, the identifier of the inventory item subject of the identified background change, and the category of the identified background change. The categories of the identified background changes in the changed data structure classify whether the identified inventory item has been added or removed relative to the background image.

因為項目可由一或更多主體所同時地取走或放在貨架上，所以該改變CNN係產生其重疊每輸出位置之定界框預測的數字「B」。定界框預測係相應於該因數化影像中之改變。考量該購物商店具有數字「C」的獨特存貨項目，各由獨特SKU所識別。該改變CNN係預測該改變之存貨項目主體的SKU。最後，改變CNN係識別針對該輸出中之每一位置(像素)的改變(或存貨事件類型)，其係指示該已識別項目是被取走自該貨架或是被放下於該貨架上。來自改變CNN之以上三對輸出係由式子「5 * B + C + 1」來描述。各定界框「B」預測係包含五(5)數字，因此「B」被乘以5。這五個數字係代表定界框之中心、定界框之寬度及高度的「x」及「y」座標。第五數字係代表針對該定界框之預測的改變CNN模型之信心分數。「B」為超參數，其可被調整以增進改變CNN模型之性能。於一實施例中，「B」之值等於4。考量來自改變CNN之輸出的寬度及高度(以像素)係各別地由W及H所表示。改變CNN之輸出被接著表達為「W*H*(5*B+C+1)」。定界框輸出模型係根據由Redmon及Farhadi於其論文「YOLO9000：更佳、更快、更強」(發佈於2016年十二月25日)中所提議的物件檢測系統。該論文可取得於https://arxiv.org/pdf/1612.08242.pdf。Because the item can be removed or placed on the shelf simultaneously by one or more subjects, the changing CNN results in the number "B" which is predicted by its bounding box overlapping each output position. The bounding box prediction corresponds to the change in the factorized image. Consider that the shopping store has a unique "C" inventory item, each identified by a unique SKU. The change CNN predicts the SKU of the subject of the changed inventory item. Finally, the change CNN identifies a change (or type of inventory event) for each location (pixel) in the output, which indicates whether the identified item was removed from the shelf or dropped on the shelf. The above three pairs of outputs from changing the CNN are described by the formula "5 * B + C + 1". Each bounding box "B" prediction system contains five (5) numbers, so "B" is multiplied by 5. These five numbers are the "x" and "y" coordinates representing the center of the bounding box, the width and height of the bounding box. The fifth number represents the confidence score of the changed CNN model for the prediction of the bounding box. "B" is a hyperparameter that can be adjusted to improve the performance of changing the CNN model. In one embodiment, the value of "B" is equal to four. The consideration comes from changing the width and height (in pixels) of the output of the CNN, which are respectively represented by W and H. The output of the changed CNN is then expressed as "W * H * (5 * B + C + 1)". The bounding box output model is based on the object detection system proposed by Redmon and Farhadi in their paper "YOLO9000: Better, Faster, Stronger" (published on December 25, 2016). The paper is available at https://arxiv.org/pdf/1612.08242.pdf.

相應於來自具有重疊觀看域之相機的影像之序列的改變CNN 2714a-2714n之輸出係由協調邏輯組件2718所結合。協調邏輯組件係處理來自具有重疊觀看域之多組相機的改變資料結構來找出真實空間中之該些已識別背景改變。協調邏輯組件2718係選擇定界框，其代表具有相同SKU及相同存貨事件類型(取走或放下)之存貨項目，來自具有重疊觀看域之多數相機。選定的定界框被接著三角測量於3D真實空間中(使用如上所述之三角測量技術)，以識別3D真實空間中之存貨項目的位置。真實空間中之貨架的位置係與3D真實空間中之存貨項目的三角測量出的位置進行比較。錯誤肯定預測被丟棄。例如，假如定界框之三角測量出的位置不映射至真實空間中之貨架的位置，則該輸出被丟棄。其映射至貨架之3D真實空間中的定界框之三角測量出的位置被視為存貨事件之真實預測。The output of the CNN 2714a-2714n corresponding to changes in the sequence of images from cameras with overlapping viewing fields is combined by the coordination logic component 2718. The coordination logic component processes the change data structure from multiple sets of cameras with overlapping viewing fields to find the identified background changes in real space. The coordination logic component 2718 selects a bounding box, which represents an inventory item with the same SKU and the same inventory event type (taken or dropped), from most cameras with overlapping viewing fields. The selected bounding box is then triangulated in the 3D real space (using the triangulation technique described above) to identify the location of the inventory item in the 3D real space. The position of the shelf in the real space is compared with the triangulated position of the inventory item in the 3D real space. False positive predictions are discarded. For example, if the triangulated position of the bounding box does not map to the position of a shelf in real space, the output is discarded. Its triangulated position mapped to the bounding box in the 3D real space of the shelf is considered as a true prediction of the inventory event.

於一實施例中，由第二影像處理器所產生的該些改變資料結構中之已識別背景改變的該些類別係分類該已識別存貨項目是否已相對於該背景影像而被加入或移除。於另一實施例中，該些改變資料結構中之已識別背景改變的該些類別係指示該已識別存貨項目是否已相對於該背景影像而被加入或移除，且該系統包括用以使背景改變與已識別主體相關聯的邏輯。該系統執行由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測。In an embodiment, the categories of the identified background changes in the changed data structure generated by the second image processor classify whether the identified inventory item has been added or removed relative to the background image . In another embodiment, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image, and the system includes The context changes the logic associated with the identified subject. The system performs the detection of the inventory items taken by the identified entities and the detection of the inventory items dropped by the identified entities on the inventory display structure.

日誌產生器2720係實施邏輯以使由改變之真實預測所識別的改變與接近該改變之位置的已識別主體相關聯。於利用關節識別引擎以識別主體之實施例中，日誌產生器2720係使用關節資料結構800以判定3D真實空間中之主體的手關節之位置。識別出一主體，其手關節是在改變之時刻落入與改變之位置的臨限值距離內。日誌產生器係使該改變與該已識別主體相關聯。The log generator 2720 implements logic to associate a change identified by a true prediction of the change with an identified subject near the location of the change. In an embodiment using a joint recognition engine to identify the subject, the log generator 2720 uses the joint data structure 800 to determine the position of the hand's hand joint in the 3D real space. A subject is identified whose hand joints fall within a threshold distance from the changed position at the moment of change. The log generator associates the change with the identified subject.

於一實施例中，如上所述，N個已遮蔽影像被結合以產生因數化影像，其被提供為輸入至該改變CNN。考量：N等於相機114之框率(每秒框數)。因此，於此一實施例中，於一第二時間週期期間之主體的手之位置係與該改變之位置進行比較以使該些改變與已識別主體相關聯。假如多於一主體之手關節位置落入與改變之位置的臨限值距離內，則該改變與主體之關聯被延緩至前台影像處理子系統2606之輸出。In one embodiment, as described above, the N occluded images are combined to generate a factorized image, which is provided as input to the changed CNN. Consideration: N is equal to the frame rate (frames per second) of the camera 114. Therefore, in this embodiment, the position of the subject's hand during a second time period is compared to the changed position to associate the changes with the identified subject. If more than one subject's hand joint position falls within a threshold distance from the changed position, the association of the change with the subject is postponed to the output of the foreground image processing subsystem 2606.

前台影像處理(區提議)子系統2606(亦稱為第三影像處理器)包括前台影像辨識引擎，其接收來自該些複數相機之影像的該些序列。該些第三影像處理器包括邏輯，用以識別並分類影像之該些相應序列中的該些影像中所表示之前台改變。該區提議子系統2606產生由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合。如圖27中所示，子系統2606包括定界框產生器1504、WhatCNN 1506及WhenCNN 1508。關節資料結構800及來自循環緩衝器1502之每相機的影像框被提供為針對定界框產生器1504之輸入。定界框產生器1504、WhatCNN 1506及WhenCNN 1508之細節被較早地提出。The foreground image processing (area proposal) subsystem 2606 (also known as the third image processor) includes a foreground image recognition engine that receives the sequences of images from the plurality of cameras. The third image processors include logic to identify and classify the previous stage changes represented in the images in the corresponding sequences of images. The area proposal subsystem 2606 generates a second set of detections by which the identified entities take inventory items away and detections by which the identified entities drop inventory items on the inventory display structure. As shown in FIG. 27, the subsystem 2606 includes a bounding box generator 1504, WhatCNN 1506, and WhenCNN 1508. The joint data structure 800 and the image frames from each camera of the circular buffer 1502 are provided as inputs to the bounding frame generator 1504. Details of the bounding box generator 1504, WhatCNN 1506, and WhenCNN 1508 were proposed earlier.

圖27中所述之系統包括選擇邏輯，用以處理檢測之第一及第二集合來產生包括已識別主體之存貨項目的列表之日誌資料結構。由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合係由日誌產生器2720所產生。檢測之第一集合係使用第二影像處理器之輸出及關節資料結構800(如上所述)來判定。由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合係使用第三影像處理器之輸出來判定。針對各真實存貨事件(取走或放下)，選擇邏輯控制器2608係選擇來自第二影像處理器(語意差異子系統2604)或第三影像處理器(區提議子系統2606)之任一者的輸出。於一實施例中，選擇邏輯係選擇來自針對該存貨事件具有較高信心分數之影像處理器的輸出。背景影像語意差異之程序流The system described in FIG. 27 includes selection logic to process the first and second sets of inspections to produce a log data structure that includes a list of inventory items for identified entities. The first set of detections of the inventory items taken by the identified entities and the detection of the inventory items dropped by the identified entities on the inventory display structure are generated by the log generator 2720. The first set detected is determined using the output of the second image processor and the joint data structure 800 (as described above). The second set of detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped by the identified subjects on the inventory display structure are determined using the output of the third image processor. For each real inventory event (remove or drop), the selection logic controller 2608 selects one from either the second image processor (semantic difference subsystem 2604) or the third image processor (region proposal subsystem 2606). Output. In one embodiment, the selection logic selects an output from an image processor with a higher confidence score for the inventory event. Program flow of semantic differences in background images

圖28A及28B提出由語意差異子系統2604所履行的詳細步驟，用以追蹤藉由真實空間之區域中的主體之改變。於購物商店之範例中，主體為移動於貨架與其他開放空間之間的走道中之商店中的消費者及商店的員工。該程序開始於步驟2802。如上所述，相機114被調校在來自相機之影像的序列被處理以識別主體之前。相機調校之細節被提出如上。具有重疊觀看域之相機114係擷取其中有主體出現之真實空間的影像。於一實施例中，相機被組態成以每秒N框的速率產生影像之同步化序列。各相機之影像的序列被儲存於每相機之各別循環緩衝器1502中，於步驟2804。循環緩衝器(亦稱為環緩衝器)係儲存時間之滑動窗中的影像之序列。背景影像儲存2704被初始化以不具前台主體之每相機的影像框之該序列中的初始影像框(步驟2806)。Figures 28A and 28B present detailed steps performed by the semantic difference subsystem 2604 to track changes to subjects in an area through real space. In the example of a shopping store, the subjects are consumers and store employees in the store moving in the aisles between the shelves and other open spaces. The program starts at step 2802. As described above, the camera 114 is calibrated before the sequence of images from the camera is processed to identify the subject. The details of camera calibration are presented above. The camera 114 with overlapping viewing fields captures images of real space in which the subject appears. In one embodiment, the camera is configured to generate a synchronized sequence of images at a rate of N frames per second. The sequence of images from each camera is stored in a separate circular buffer 1502 for each camera, at step 2804. A circular buffer (also known as a ring buffer) is a sequence of images in a sliding window that stores time. The background image storage 2704 is initialized with the initial image frame in the sequence of the image frame of each camera without the foreground subject (step 2806).

當主體於貨架之前方移動時，每主體之定界框係使用其相應關節資料結構800(如上所述)來產生(步驟2808)。於步驟2810，已遮蔽影像係藉由以來自背景影像儲存2704之來自背景影像的相同位置上之像素取代每影像框的定界框中之像素來產生。相應於每相機的影像之該些序列中之各影像的已遮蔽影像被儲存於背景影像儲存2704中。第i已遮蔽影像被使用為背景影像，用以取代每相機的影像框之該序列中的後續(i+1)影像框中之像素。When the subject moves in front of the shelf, the bounding frame of each subject is generated using its corresponding joint data structure 800 (as described above) (step 2808). At step 2810, the occluded image is generated by replacing pixels in the bounding box of each image frame with pixels from the same location from the background image from the background image store 2704. The occluded image corresponding to each of the images in the sequences of each camera image is stored in the background image storage 2704. The i-th occluded image is used as a background image to replace pixels in subsequent (i + 1) image frames in the sequence of the image frame of each camera.

於步驟2812，N個已遮蔽影像被結合以產生因數化影像。於步驟2814，差異熱映圖係藉由比較多對因數化影像之像素值來產生。於一實施例中，介於兩因數化影像(fi 1及fi 2)的2D空間中之位置(x, y)上的像素之間的差異被計算於方程式1中如以下所示： At step 2812, the N masked images are combined to generate a factorized image. At step 2814, the difference thermal map is generated by comparing the pixel values of the multiple logarithmic image. In one embodiment, the difference between the pixels at the position (x, y) in the 2D space of the two factorized images ( fi 1 and fi 2) is calculated in Equation 1 as shown below:

介於2D空間中之相同x及y位置上的像素之間的差異係使用紅、綠及藍(RGB)頻道(如該方程式中所示)之各別強度值來判定。以上方程式係提供兩個因數化影像中的相應像素之間的差異之數值(亦稱為歐幾里德模值)。Differences between pixels at the same x and y positions in 2D space are determined using the respective intensity values of the red, green, and blue (RGB) channels (as shown in the equation). The above equation provides a numerical value (also known as Euclidean modulus) of the difference between corresponding pixels in two factorized images.

差異熱映圖可含有雜訊，由於真實空間之區域中的感應器雜訊及發光度改變。在圖28B中，於步驟2816，位元遮罩被產生給差異熱映圖。語意上有意義的改變係由該位元遮罩中之1(一)的叢集所識別。這些叢集係相應於其識別從貨架取走或被放在貨架上的存貨項目之改變。然而，差異熱映圖中之雜訊可引入隨機1於該位元遮罩中。此外，多數改變(取自該貨架或放在該貨架上之多數項目)可引入1之重疊叢集。於該程序流之下一步驟(2818)，影像形態操作被應用至該位元遮罩。影像形態操作係移除雜訊(不想要的1)且亦嘗試分離1之重疊叢集。如此導致較乾淨的位元遮罩，其包含相應於語意上有意義的改變之1的叢集。The differential thermal map may contain noise due to changes in sensor noise and luminosity in areas of real space. In FIG. 28B, in step 2816, a bit mask is generated to the difference heat map. Meaningful changes are identified by a cluster of 1 (one) in the bit mask. These clusters correspond to changes that identify inventory items that have been removed from or placed on a shelf. However, noise in the difference heat map can introduce random ones into the bit mask. In addition, most changes (taken from the shelf or most items placed on the shelf) can introduce an overlapping cluster of 1. In the next step (2818) of the program flow, the image shape operation is applied to the bit mask. Image morphology operations remove noise (unwanted ones) and also try to separate overlapping clusters of ones. This results in a cleaner bit mask that contains a cluster corresponding to one of the semantically meaningful changes.

兩輸入被提供至形態操作。第一輸入為該位元遮罩而第二輸入被稱為結構元件或內核。兩個基本形態操作為「侵蝕」及「膨脹」。內核係由以多種大小之矩形矩陣所配置的1所組成。不同形狀(例如，圓形、橢圓形或十字形狀)的內核係藉由相加該矩陣中之特定位置上的0來產生。不同形狀的內核被用於影像形態操作以獲得於清潔位元遮罩時之所欲結果。於侵蝕操作時，內核係滑動(或移動)於該位元遮罩之上。該位元遮罩中之像素(1或0之任一者)被視為1，假如於該內核之下的所有像素均為1的話。否則，其被侵蝕(改變至0)。侵蝕操作可用於移除該位元遮罩中之隔離的1。然而，侵蝕亦藉由侵蝕邊緣而縮小了1的該些叢集。Two inputs are provided to the morphological operation. The first input is the bit mask and the second input is called a structural element or kernel. The two basic form operations are "erosion" and "swell". The kernel consists of 1s arranged in a rectangular matrix of various sizes. Kernels of different shapes (eg, circular, elliptical, or cross-shaped) are generated by adding zeros at specific positions in the matrix. Differently shaped kernels are used for image morphology operations to achieve the desired results when cleaning bit masks. During the erosion operation, the kernel slides (or moves) over the bit mask. Pixels (either 1 or 0) in the bit mask are considered as 1 if all pixels below the kernel are 1. Otherwise, it is eroded (changed to 0). Erosion can be used to remove isolated ones in the bit mask. However, erosion also reduces these clusters of 1 by erosion edges.

膨脹操作為侵蝕的相反。於此操作中，當內核滑動於該位元遮罩之上時，由該內核所重疊之位元遮罩區域中的所有像素之值均被改變至1，假如該內核之下的至少一像素之值為1的話。膨脹被應用至侵蝕後之該位元遮罩以增加1之大小叢集。因為雜訊在侵蝕時被移除，所以膨脹不會將隨機雜訊引入至該位元遮罩。侵蝕與膨脹操作之組合被應用以獲得較乾淨的位元遮罩。例如，電腦程式碼之後續行係將1之3x3過濾器應用至該位元遮罩以履行「開放」操作，其係應用侵蝕操作接著膨脹操作以移除雜訊並復原該位元遮罩中之1的叢集之大小，如上所述。上述電腦程式碼係使用針對即時電腦視覺應用程式之編程功能的OpenCV(開放式來源電腦視覺)庫。該庫可取得於 The expansion operation is the opposite of erosion. In this operation, when the kernel slides over the bit mask, the values of all pixels in the bit mask area overlapped by the kernel are changed to 1, if at least one pixel below the kernel If the value is 1. Dilation is applied to the bit mask after erosion to increase the size cluster by one. Because noise is removed during erosion, dilation does not introduce random noise to the bit mask. A combination of erosion and expansion operations is applied to obtain a cleaner bit mask. For example, a subsequent line of computer code applies a 3x3 filter of 1 to the bit mask to perform an "open" operation, which applies an erosion operation followed by an expansion operation to remove noise and restore the bit mask. The size of the cluster of 1 is as described above. The above computer code uses the OpenCV (Open Source Computer Vision) library for programming functions of real-time computer vision applications. The library is available at

「關閉」操作係應用膨脹操作接著侵蝕操作。其可用於關閉1之叢集內部的小洞。以下程式碼係使用30x30之十字形狀的過濾器以應用關閉操作至該位元遮罩。 The "close" operation applies an expansion operation followed by an erosion operation. It can be used to close small holes inside the 1 cluster. The following code uses a 30x30 cross-shaped filter to apply a close operation to that bit mask.

該位元遮罩及兩個因數化影像(之前及之後)被提供為輸入至每相機之卷積神經網路(稱之為如上的改變CNN)。改變CNN之輸出為改變資料結構。於步驟2822，來自具有重疊觀看域之改變CNN的輸出係使用較早所述之三角測量技術而被結合。3D真實空間中之改變的位置係與貨架之位置匹配。假如存貨事件之位置映射至貨架上之位置，則該改變被視為真實事件(步驟2824)。否則，該改變為錯誤肯定且被丟棄。真實事件係與前台主體相關。於步驟2826，前台主體被識別。於一實施例中，關節資料結構800被用以判定該改變之臨限值距離內的手關節之位置。假如前台主體被識別於步驟2828，則該改變被關聯至該已識別主體，於步驟2830。假如無前台主體被識別於步驟2828，例如，由於該改變之臨限值距離內的多數主體之手關節位置。接著藉由區提議子系統之該改變的冗餘檢測被選擇，於步驟2832。該程序結束於步驟2834。訓練改變CNNThis bit mask and two factorized images (before and after) are provided as input to a convolutional neural network per camera (referred to as the modified CNN as above). Change the output of CNN to change the data structure. At step 2822, the output from the altered CNN with overlapping viewing fields is combined using the triangulation technique described earlier. The changed position in the 3D real space matches the position of the shelf. If the location of the inventory event is mapped to a location on the shelf, the change is considered a real event (step 2824). Otherwise, the change is false positive and discarded. The real event is related to the foreground subject. At step 2826, the foreground subject is identified. In one embodiment, the joint data structure 800 is used to determine the position of the hand joint within a threshold distance of the change. If the foreground subject is identified at step 2828, the change is associated with the identified subject, at step 2830. If no foreground subject is identified in step 2828, for example, the hand joint positions of most subjects within a range of the threshold of the change. Redundancy detection of the change by the zone proposal subsystem is then selected, at step 2832. The process ends at step 2834. Training changes CNN

七個頻道輸入之訓練資料集被產生以訓練該改變CNN。當作消費者之一或更多主體係藉由假裝在購物商店中購物以履行取走及放下動作。主體係於走道中移動，從貨架取走存貨項目以及將項目放回貨架上。履行取走及放下動作的演員之影像被收集於循環緩衝器1502中。該些影像被處理以產生因數化影像，如上所述。多對因數化影像2706以及由位元遮罩計算器2710所輸出之相應位元遮罩被手動地檢視以視覺地識別介於兩因數化影像之間的改變。針對具有改變之因數化影像，定界框被手動地描繪繪製於該改變周圍。此為最小的定界框，其含有相應於該位元遮罩中之該改變的1之叢集。該改變中之存貨項目的SKU數被識別且被包括於針對該影像(連同該定界框)之標籤中。識別存貨項目之取走或放下的事件類型亦被包括於該定界框之標籤中。因此各定界框之標籤係識別(在該因數化影像上之其位置)該項目之SKU以及該事件類型。因數化影像可具有多於一個定界框。上述程序被重複於該訓練資料集中之所有已收集因數化影像中的每一改變。一對因數化影像(連同該位元遮罩)係形成對於該改變CNN之七個頻道輸入。Seven channel input training data sets are generated to train the changing CNN. Act as one or more consumers to fulfill the take and drop action by pretending to shop in a shopping store. The main system moves in the aisle, takes inventory items from the shelves and puts the items back on the shelves. Images of actors performing take-off and drop-down operations are collected in a circular buffer 1502. These images are processed to produce a factorized image, as described above. The multi-pair factorized image 2706 and the corresponding bit mask output by the bit mask calculator 2710 are manually inspected to visually identify changes between the two factorized images. For factorized images with changes, bounding boxes are manually drawn around the change. This is the smallest bounding box containing a cluster of 1s corresponding to the change in the bit mask. The SKU number of the changed inventory item is identified and included in the label for the image (along with the bounding box). The type of event identifying the removal or drop of inventory items is also included in the label of the bounding box. The label of each bounding box therefore identifies (its position on the factorized image) the item's SKU and the event type. A factorized image may have more than one bounding box. The above procedure is repeated for every change in all the collected factorized images of the training data set. A pair of factorized images (together with the bit mask) form seven channel inputs for the changed CNN.

於該改變CNN之訓練期間，前向傳遞及後向傳播被履行。於前向傳遞中，該改變CNN係識別並分類背景改變，其被表示於該訓練資料集中的影像之該些相應序列中的因數化影像中。該改變CNN係處理已識別背景改變以進行由已識別主體取走存貨項目的檢測及由已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合。於後向傳播期間，該改變CNN之輸出係與該地面真相(如訓練資料集之標籤中所指示)進行比較。一或更多成本函數之梯度被計算。梯度被接著傳播至卷積神經網路(CNN)及完全連接(FC)神經網路以致其預測誤差被減少，造成輸出更接近於地面真相。於一實施例中，softmax函數及交叉熵損失函數被用於針對該輸出之類別預測部分的改變CNN之訓練。該輸出之類別預測部分包括該存貨項目及該事件類型(亦即，取走或放下)之SKU識別符。During the training of the altered CNN, forward pass and backward pass are performed. In the forward pass, the change CNN is to identify and classify the background changes, which are represented in the factorized images in the corresponding sequences of the images in the training data set. The change CNN is the first set of processing the identified background change to perform the detection of the inventory item taken by the identified subject and the detection of the inventory item dropped by the identified subject on the inventory display structure. During backward propagation, the output of the altered CNN is compared to the ground truth (as indicated in the label of the training data set). The gradient of one or more cost functions is calculated. The gradient is then propagated to the convolutional neural network (CNN) and the fully connected (FC) neural network so that its prediction error is reduced, causing the output to be closer to the ground truth. In one embodiment, a softmax function and a cross-entropy loss function are used to train a CNN that changes the class prediction portion of the output. The category prediction portion of the output includes the SKU identifier for the inventory item and the event type (ie, removed or dropped).

第二損失函數被用以訓練針對定界框之預測的改變CNN。此損失函數係計算介於預測框與地面真相框之間的intersection over union(IOU)。由具有真實定界框標籤之改變CNN所預測的定界框之交點的區域被除以相同定界框之聯集的區域。假如介於預測框與地面真相框之間的重疊很大，則IOU之值是高的。假如多於一預測定界框重疊地面真相定界框，則具有最高IOU值之一者被選擇以計算損失函數。損失函數之細節係由Redmon等人所提出於其論文中「You Only Look Once: Unified, Real-Time Object Detection」發佈於2016年五月9日。該論文可取得於https://arxiv.org/pdf/1506.02640.pdf。特定實施方式The second loss function is used to train the predicted CNN against the bounded box. This loss function calculates the intersection over union (IOU) between the prediction frame and the ground truth frame. The area of the intersection of the bounding boxes predicted by the changing CNN with the true bounding box label is divided by the area of the associated set of the same bounding box. If the overlap between the prediction frame and the ground truth frame is large, the value of IOU is high. If more than one predicted bounding box overlaps the ground truth bounding box, one with the highest IOU value is selected to calculate the loss function. The details of the loss function were proposed by Redmon et al. In their paper "You Only Look Once: Unified, Real-Time Object Detection" and published on May 9, 2016. The paper is available at https://arxiv.org/pdf/1506.02640.pdf. Specific implementation

於各個實施例中，用以追蹤真實空間之區域中藉由主體之存貨項目的放下及取走之系統(如上所述)亦包括以下特徵之一或更多者。 1. 區提議In various embodiments, the system (as described above) for tracking down and taking off the inventory items of the subject in the area of real space also includes one or more of the following features. District proposal

區提議為來自其涵蓋人之所有不同相機的手位置之框影像。區提議係由系統中之每一相機所產生。其包括空手以及攜帶商店項目的手。 1.1 WhatCNN模型The area is proposed as a frame image from the hand positions of all the different cameras covering the person. Zone proposals are generated by each camera in the system. It includes empty hands as well as hands carrying store items. 1.1 WhatCNN model

區提議可被使用為針對使用深學習演算法之影像分類的輸入。此分類引擎被稱為「WhatCNN」模型。其為手中分類模型。其係分類手中的事物。即使物件之部分被手所阻擋，手中影像分類仍可操作。較小的項目可被手阻擋高達90%。藉由WhatCNN模型之影像分析的區被有意地保持為小(於某些實施例中)，因為其是計算上昂貴的。各相機可具有專屬GPU。此係針對每一框而被履行於來自每一相機之每一手影像。除了藉由WhatCNN模型之上述影像分析以外，信心加權亦被指派給該影像(一相機、一時點)。分類演算法係輸出涵蓋庫存保持單元(SKU)之完整列表的羅吉特以產生針對n個項目之該商店的產品和服務識別碼列表及針對空手(n+1)之一額外者。District proposals can be used as input for image classification using deep learning algorithms. This classification engine is called the "WhatCNN" model. It is a classification model in hand. It is the classification of things. Even if part of the object is blocked by the hand, the image classification in the hand is still operable. Smaller items can be blocked up to 90% by hand. The area of image analysis by the WhatCNN model is intentionally kept small (in some embodiments) because it is computationally expensive. Each camera can have a dedicated GPU. This is performed for each frame with each hand image from each camera. In addition to the above-mentioned image analysis by the WhatCNN model, confidence weighting is also assigned to the image (one camera, one point in time). The classification algorithm outputs a logit that includes a complete list of inventory keeping units (SKUs) to generate a list of product and service identifiers for the store for n items and an extra for empty hands (n + 1).

場景程序現在藉由傳送密鑰-值字典至各視頻以將其結果傳回至各視頻程序。於此，密鑰為獨特關節ID而值為該關節所關聯的獨特個人ID。假如無任何人與該關節相關聯，則其不被包括於該字典中。The scene program now returns its results to each video program by transmitting a key-value dictionary to each video. Here, the key is a unique joint ID and the value is a unique personal ID associated with the joint. If no one is associated with the joint, it is not included in the dictionary.

各視頻程序從場景程序接收密鑰-值字典，並將其儲存入環緩衝器，其係將框數目映射至返回的字典。Each video program receives a key-value dictionary from the scene program and stores it into a ring buffer, which maps the number of frames to the returned dictionary.

使用返回的密鑰-值字典，該視頻在各時刻選擇其接近與已知的人關聯的手之影像的子集。這些區為numpy片段。吾人亦取得類似的片段於前台遮罩周圍以及關節CNN之原始特徵陣列。這些結合的區被序連在一起而成為單一多維numpy陣列且被儲存於資料結構中，該資料結構係保存：與該區關聯的該numpy陣列和該個人ID、以及該區係來自該個人的哪隻手。Using the returned key-value dictionary, the video selects a subset of its images near the hands associated with known people at each moment. These areas are numpy fragments. We also obtained similar clips around the foreground mask and the original feature array of the joint CNN. These combined regions are connected together in order to form a single multi-dimensional numpy array and stored in a data structure, the data structure is kept: the numpy array and the personal ID associated with the region, and the floristic from the individual Which hand.

所有提議區被接著饋送入FIFO佇列。此佇列係接受數區且將其numpy陣列推入GPU上之記憶體。All proposed areas are then fed into the FIFO queue. This queue accepts regions and pushes its numpy array into memory on the GPU.

當陣列到達GPU時，其被饋送入一專用於分類之CNN，稱之為WhatCNN。此CNN之輸出為大小N+1之浮點的平坦陣列，其中N為該商店中之獨特SKU的數目，而最後類別係代表無類別(或空手)。此陣列中之該些浮點被稱為羅吉特。When the array reaches the GPU, it is fed into a CNN dedicated to classification, called WhatCNN. The output of this CNN is a flat array of floating points of size N + 1, where N is the number of unique SKUs in the store, and the last category represents no category (or empty hand). These floating points in this array are called logit.

WhatCNN之結果被儲存回入區資料結構。The results of WhatCNN are stored back into the zone data structure.

針對一時刻之所有區被接著從各視頻程序傳回至場景程序。All zones for a moment are then passed back from each video program to the scene program.

該場景程序在某一時刻接收來自所有視頻之所有區並將結果儲存於密鑰-值字典中，其中該密鑰為個人ID而該值為密鑰-值字典，其中該密鑰為相機ID而該值為區之羅吉特。The scene program receives all regions from all videos at a time and stores the result in a key-value dictionary, where the key is a personal ID and the value is a key-value dictionary, where the key is a camera ID The value is the Logit of the district.

此聚合資料結構被接著儲存於環緩衝器，其係將框數目映射至聚合結構於各時刻。 1.2 WhenCNN模型This aggregate data structure is then stored in a ring buffer, which maps the number of frames to the aggregate structure at each time. 1.2 WhenCNN model

由WhatCNN模型所處理之來自不同相機的影像在一段時間週期期間被結合(在一段時間週期期間之多數相機)。對於此模型之額外輸入為3D空間中之手位置，三角測量自多數相機。對於此演算法之另一輸入為手與該商店之貨架圖的距離。於某些實施例中，貨架圖可被用以識別該手是否接近一含有特定項目(例如，cheerios盒子)的貨架。對於此演算法之另一輸入為在該商店上之足部位置。Images from different cameras processed by the WhatCNN model are combined during a period of time (most cameras during a period of time). The additional input to this model is the hand position in 3D space, triangulated from most cameras. Another input for this algorithm is the distance between the hand and the store's planogram. In some embodiments, a planogram can be used to identify whether the hand is approaching a shelf containing a specific item (eg, a cheerios box). Another input to this algorithm is the foot position on the store.

除了使用SKU之物件分類以外，第二分類模型係使用時間序列分析以判定該物件是被拾起自該貨架或者是被放在該貨架上。該些影像在一段時間週期期間被分析以判定其在先前影像框中位於該手中的該物件是已被放回該貨架中或者是已被拾起自該貨架。In addition to using SKUs to classify objects, the second classification model uses time series analysis to determine whether the object was picked up from the shelf or placed on the shelf. The images are analyzed during a period of time to determine whether the item that was in the hand in the previous image frame has been put back into the shelf or picked up from the shelf.

針對一第二時間(每秒30框)週期及三個相機，系統將具有90個類別輸出，針對相同手加信心。此結合影像分析顯著地增加了正確地識別該手中之物件的機率。涵蓋時間的分析係增進了輸出之品質，儘管是各別框之某些極低信心位準的輸出。此步驟可具有(例如)從80%準確度至95%準確度之輸出信心。For a second time (30 frames per second) period and three cameras, the system will have 90 category outputs, and increase confidence for the same hand. This combined with image analysis significantly increases the chances of correctly identifying an object in the hand. Coverage analysis improves the quality of the output, albeit at the output of some very low confidence levels in the respective boxes. This step may have, for example, output confidence from 80% accuracy to 95% accuracy.

此模型亦包括來自貨架模型之輸出以當作其輸入，用來識別此人已拾起什麼物件。This model also includes the output from the shelf model as its input to identify what items the person has picked up.

場景程序等待30或更多聚合結構累積(其代表真實時間之至少一秒)，並接著履行進一步分析以向下減少聚合結構至針對每一個人ID-手對之單一整數，其中該整數為代表該商店中之SKU的獨特ID。針對一時刻，此資訊被儲存於密鑰-值字典中，其中密鑰為個人ID-手對，而值為SKU整數。此字典係隨著時間經過而被儲存於環緩衝器，其係將框數目映射至針對該時刻之各字典。The scenario program waits for 30 or more aggregate structures to accumulate (which represents at least one second of real time), and then performs further analysis to reduce the aggregate structure down to a single integer for each individual ID-hand pair, where the integer represents the Unique ID of the SKU in the store. For a moment, this information is stored in a key-value dictionary, where the key is a personal ID-hand pair and the value is a SKU integer. This dictionary is stored in a ring buffer as time passes, which maps the number of boxes to the dictionaries for that moment.

可接著履行額外分析以觀察此字典如何隨著時間經過而改變以識別個人在什麼時刻取走某物以及其取走什麼東西。此模型(WhenCNN)係發出SKU羅吉特以及針對各布林問題：某物被取走?某物被放置?之羅吉特。Additional analysis can then be performed to observe how this dictionary changes over time to identify when an individual took something and what it took away. This model (WhenCNN) is issued SKU logit and logit for each Brin problem: something is taken away? Something is placed.

WhenCNN之輸出被儲存於環緩衝器，其係將框數目映射至密鑰-值字典，其中密鑰為個人ID而值為由WhenCNN所發出之延伸羅吉特。The output of WhenCNN is stored in a ring buffer, which maps the number of boxes to a key-value dictionary, where the key is a personal ID and the value is an extended logit issued by WhenCNN.

啟發法之另一集合被接著運行於WhenCNN及人之已儲存關節位置兩者之儲存結果上、以及於商店貨架上之項目的預先計算映圖上。啟發法之此集合係判定其取走及放下係導致項目被加至或移除自何處。針對各取走/放下，該些啟發法係判定該取走或放下係自或至貨架、自或至籃子、或者自或至個人。該輸出為針對每個人的存貨，其被儲存為一陣列，其中在SKU之索引上的陣列值為個人所擁有的那些SKU之數目。Another set of heuristics is then run on the stored results of both WhenCNN and the person's stored joint positions, and on pre-calculated maps of items on the store shelves. This collection of heuristics determines where the removal and dropping of the heuristics caused the item to be added to or removed from. For each take / drop, the heuristics determine that the take or drop is from the shelf to the shelf, from the basket, or from the individual. The output is inventory for each person, which is stored as an array, where the array value on the index of the SKU is the number of those SKUs owned by the individual.

當購物者接近商店之出口時，該系統可傳送存貨列表至該購物者的手機。該手機接著顯示該使用者的存貨並要求確認從其所儲存的信用卡資訊收費。假如使用者接受，則其信用卡將被收費。假如其不具有該系統中所已知的信用卡，則其將被要求提供信用卡資訊。When a shopper approaches the store's exit, the system can transfer the inventory list to the shopper's mobile phone. The phone then displays the user's inventory and asks to confirm the charge from the credit card information stored in it. If the user accepts, their credit card will be charged. If it does not have a credit card known in the system, it will be required to provide credit card information.

替代地，購物者亦可靠近商店內的服務台(kiosk)。該系統係識別出該購物者於何時接近該服務台且將傳送訊息至該服務台以顯示該購物者的存貨。該服務台要求該購物者接受該存貨之收費。假如購物者接受，則其可接著刷他們的信用卡或者插入現金來付款。圖16提出針對區提議之WhenCNN模型的圖示。 2. 錯置的項目Alternatively, shoppers can also approach the kiosk in the store. The system identifies when the shopper approaches the service desk and sends a message to the service desk to display the shopper's inventory. The help desk requires the shopper to accept a charge for the inventory. If a shopper accepts, they can then swipe their credit card or insert cash to pay. Figure 16 presents a diagram of the WhenCNN model for district proposals. Misplaced items

此特徵係識別錯置的項目，當該些項目被個人放回隨機的貨架上時。如此造成物件識別的問題，因為相對於貨架圖之足部及手位置將是不正確的。因此，該系統隨著時間經過而建立修改的貨架圖。根據先前的時間序列分析，該系統能夠判定個人是否已將項目放回該貨架中。下一次，當物件從該貨架位置被拾起時，該系統便得知有至少一錯置的項目在該手位置上。相應地，演算法將具有其該個人可能從該貨架拾起錯置的項目之一些信心。假如該錯置的項目被拾起自該貨架，則該系統便從該位置減去該項目，該貨架不再具有該項目。該系統亦可經由應用程式以告知店員有關錯置的項目以致該店員可將該項目移至其正確的貨架。 3. 語意差異(貨架模型)This feature identifies misplaced items when those items are returned to a random shelf by an individual. This causes the problem of object recognition, because the foot and hand positions relative to the plan view will be incorrect. As a result, the system builds modified planograms over time. Based on previous time series analysis, the system was able to determine whether an individual had placed items back on the shelf. Next time, when an item is picked up from the shelf position, the system learns that at least one misplaced item is in the hand position. Accordingly, the algorithm will have some confidence that the individual may pick up misplaced items from the shelf. If the misplaced item is picked up from the shelf, the system subtracts the item from the location, and the shelf no longer has the item. The system can also use an app to inform the store clerk about the misplaced item so that the store clerk can move the item to its correct shelf. 3. Semantic differences (shelf model)

用於背景影像處理之替代技術包含背景減去演算法，用以識別對於該些貨架上之項目的改變(項目被移除或放置)。此係根據像素位準上之改變。假如有人在該貨架前方，則該演算法便停止以致其不會將由於人的存在所致之像素改變列入考量。背景減去為一種雜訊程序。因此，跨相機分析被執行。假如有足夠的相機同意其該貨架上有「語意上有意義的」改變，則該系統便記錄其在該貨架之該部分中有改變。Alternative techniques for background image processing include background subtraction algorithms to identify changes to items on the shelves (items are removed or placed). This is based on changes in pixel level. If someone is in front of the shelf, the algorithm stops so that it does not take into account pixel changes due to the presence of people. Background subtraction is a noise process. Therefore, cross-camera analysis is performed. If enough cameras agree that there is a "semanticly meaningful" change on that shelf, the system records that it has changed in that part of the shelf.

下一步驟係用以識別該改變是「放下」或是「取走」改變。對此，第二分類模型之時間序列分析被使用。針對該貨架之該特定部分的區提議被產生並通過深學習演算法。此比手中影像分析更為容易，因為該物件不會被阻擋在手內部。第四輸入被提供至該演算法，除了三個典型的RGB輸入以外。該第四頻道為背景資訊。該貨架或語意差異之輸出被再次輸入至第二分類模型(時間序列分析模型)。The next step is to identify whether the change is a "drop" or "take away" change. In this regard, a time series analysis of the second classification model is used. A zone proposal for this particular part of the shelf is generated and passed through a deep learning algorithm. This is easier than hand image analysis, because the object is not blocked inside the hand. A fourth input is provided to the algorithm, in addition to three typical RGB inputs. The fourth channel is background information. The output of this shelf or semantic difference is re-input to the second classification model (time series analysis model).

此方式中之語意差異包括以下步驟：　　1. 來自相機之影像係與來自相同相機之較早影像進行比較。　　2. 介於兩影像之間的各相應像素係經由RGB空間中之歐幾里德距離而被比較。　　3. 在某臨限值之上的距離被標記，其導致剛標記的像素之新影像。　　4. 影像形態過濾器之集合被用以從該已標記影像移除雜訊。　　5. 吾人接著搜尋已標記像素之大型集合並形成於其周圍之定界框。　　6. 針對各定界框，吾人接著觀察兩影像中之原始像素以獲得兩個影像快照。　　7. 這兩個影像快照被接著推入CNN，其被訓練以分類該影像區是代表被取走的項目或者是代表被放置的項目以及該項目是什麼。 3. 商店稽查The semantic differences in this method include the following steps: 1. The images from the camera are compared with earlier images from the same camera. 2. The corresponding pixels between the two images are compared via the Euclidean distance in the RGB space. 3. The distance above a certain threshold is marked, which results in a new image of the pixel just marked. 4. The collection of image shape filters is used to remove noise from the marked image. 5. We then search for a large collection of labeled pixels and form a bounding box around them. 6. For each bounding box, we then observe the original pixels in the two images to obtain two image snapshots. 7. These two image snapshots are then pushed into CNN, which is trained to classify whether the image area represents the item being taken away or the item being placed and what the item is. 3. Store Audit

各貨架之存貨係由該系統所維持。當項目被消費者所拾起時其便被更新。於任何時點，該系統能夠產生商店存貨之稽查。 4. 手中之多數項目The inventory of each shelf is maintained by the system. Items are updated when they are picked up by consumers. At any point in time, the system can produce an audit of store inventory. 4. Most items in hand

不同影像被用於多數項目。手中的兩個項目與一個項目相較之下係被不同地處置。某些演算法僅可預測一個項目而非一項目之數個。因此，CNN被訓練以致其針對「二」數量的項目可不同於手中之單一項目來執行。 5. 資料收集系統Different images are used for most projects. The two projects in hand were treated differently than one. Some algorithms can only predict one item instead of several items. As a result, CNNs are trained so that their number of items for "two" can be different from a single item in their hands. 5. Data collection system

預先定義的購物腳本被用以收集影像之良好品質的資料。這些影像被用於演算法之訓練。 5.1 購物腳本Pre-defined shopping scripts are used to collect good quality data from the images. These images are used for algorithm training. 5.1 Shopping script

資料收集包括以下步驟：　　1. 腳本被自動地產生以告知人類演員應採取哪些動作。　　2. 這些動作被隨機地取樣自包括以下之動作集合：取走項目X、放下項目X、持有項目X達Y秒。　　3. 當履行這些動作時，演員係移動並使其本身盡可能多方式地定向，而同時在該既定動作上仍成功。　　4. 於動作之序列期間，相機之集合係從許多觀點記錄該些演員。　　5. 在該些演員已完成該腳本後，相機視頻被捆在一起並連同原始腳本而被儲存。　　6. 該腳本係作用為對於在演員之視頻上所訓練的機器學習模型(諸如CNN)之輸入標籤。 6. 產品線Data collection includes the following steps: 1. Scripts are automatically generated to inform human actors what actions to take. 2. These actions are randomly sampled from a set of actions including: removing item X, dropping item X, and holding item X for Y seconds. 3. When performing these actions, the actor moves and orients itself in as many ways as possible, while still succeeding in the intended action. 4. During the sequence of actions, the collection of cameras records the actors from many perspectives. 5. After the actors have completed the script, the camera videos are bundled together and stored with the original script. 6. This script is used as input tags for machine learning models (such as CNN) trained on actors' videos. 6. Product Line

該系統及其部分可被用於無出納員結帳，其係由以下應用程式所支援。 6.1 商店應用程式The system and its parts can be used for cashierless checkout, which is supported by the following applications. 6.1 Store app

商店應用程式具有數個主要可能性：提供資料分析視覺化、支援損失預防、及提供平台以輔助消費者，藉由顯示零售商有關人在商店中的何處以及他們已收集了什麼商品。對於員工之許可位準以及應用程式存取權可由零售商所決定。 6.1.1 標準分析The store application has several main possibilities: providing data analysis visualization, supporting loss prevention, and providing a platform to assist consumers by showing where the retailer is in the store and what items they have collected. The level of permission for employees and access to applications can be determined by the retailer. 6.1.1 Standard analysis

資料係由平台所收集且可被使用以多種方式。　　1.衍生資料被用以履行對於以下各者之多種分析：商店、其所提供的購物經驗、以及消費者與產品、環境、及其他人的互動。　　　　a. 該資料被儲存並使用於背景中以履行商店與消費者互動之分析。商店應用程式將顯示此資料之某些視覺化給零售商。其他資料被儲存並詢問(當想要該資料點時)。　　2.熱映圖：　　平台將以下視覺化：零售商的平面圖、貨架佈局、及其他商店環境，具有顯示多種活動之位準的重疊圖。　　1. 範例：　　　　1. 針對人走過、但並未觸摸任何產品的地點之地圖。　　　　2. 針對當與產品互動時人所站立的處所之地圖。　　　　3. 錯置的項目：　　該平台係追蹤商店之SKU的所有者。當項目被放在不正確的位置時，該平台將知道該項目在哪裡並建立日誌。於某臨限值，或立即地，商店員工可被警示有關錯置的項目。替代地，員工可存取商店應用程式中之錯置的項目映圖。當方便時，員工可接著快速地找出並校正錯置的項目。 6.1.2 標準輔助　　• 商店應用程式將顯示商店的平面圖。　　• 其將顯示圖形以表示該商店中的每個人。　　• 當該圖形被選擇(經由接觸、按壓、或其他手段)時，針對商店員工之相關資訊將被顯示。例如：購物車項目(其已收集之項目)將出現在列表中。　　• 假如該平台具有低於針對特定項目及針對一段時間週期之預定臨限值的信心位準(有關其係為某人所擁有(購物車))，則其圖形(目前為一個點)將指示該差異。該應用程式系使用顏色改變。綠色指示高信心而黃色/橘色係指示較低的信心。　　• 具有商店應用程式之商店員工可被告知有關該較低的信心。他們可以去確認消費者的購物車是正確的。　　• 透過商店應用程式，零售商之員工將能夠調整消費者的購物車項目(加入或刪除)。 6.1.3 標準LP 　　• 假如購物者正在使用商店應用程式，則其僅需離開商店且被收費。然而，假如其不是的話，則其將必須使用訪客應用程式以針對其購物車中的項目付款。　　• 假如購物者在其離開商店的途中繞過訪客應用程式，則其圖形係指示其必須在離開前被靠近。該應用程式係使用顏色之改變至紅色。人員亦接收潛在損失之通知。　　• 透過商店應用程式，零售商之員工將能夠調整消費者的購物車項目(加入或刪除)。 6.2 非商店應用程式Information is collected by the platform and can be used in a variety of ways. 1. Derived data is used to perform a variety of analyses of: stores, the shopping experience they provide, and consumer interactions with products, the environment, and others. A. This information is stored and used in the background to perform analysis of store-consumer interactions. The store app will display some visualizations of this information to the retailer. Other data are stored and asked (when the data point is desired). 2. Heat map: The platform visualizes the following: the retailer's floor plan, shelf layout, and other store environments, with overlays showing the level of multiple activities. 1. Example: 1. Map for a place where people walked but did not touch any product. 2. Map for the place where people stand when interacting with the product. 3. Misplaced items: The platform is the owner of the SKU that tracks the store. When a project is placed in an incorrect location, the platform will know where the project is and establish a log. At certain thresholds, or immediately, store employees can be alerted about misplaced items. Alternatively, employees can access misplaced item maps in the store application. When convenient, employees can then quickly find and correct misplaced items. 6.1.2 Standard Assistance • The store app will show the floor plan of the store. • It will display a graphic to represent everyone in the store. • When the graphic is selected (via contact, pressing, or other means), relevant information for store employees will be displayed. For example: a shopping cart item (the item it has collected) will appear in the list. • If the platform has a level of confidence below a predetermined threshold for a specific project and for a period of time (about it being owned by someone (shopping cart)), its graphic (currently a dot) will indicate The difference. The app uses color changes. Green indicates high confidence and yellow / orange indicates lower confidence. • Store employees with a store app can be told about this lower confidence. They can confirm that the consumer's shopping cart is correct. • Through the store app, retailer employees will be able to adjust (add or delete) shopping cart items for consumers. 6.1.3 Standard LP • If a shopper is using a store app, they only need to leave the store and be charged. However, if it weren't, they would have to use the guest app to pay for items in their shopping cart. • If shoppers bypass the visitor app on their way out of the store, their graphics indicate that they must be approached before leaving. The app uses a color change to red. Personnel also receive notification of potential losses. • Through the store app, retailer employees will be able to adjust (add or delete) shopping cart items for consumers. 6.2 Non-Store Apps

以下分析特徵係表示該平台之額外能力。 6.2.1 標準分析 1. 產品互動：　　產品互動之粒度分解，諸如：　　a. 針對各產品之互動時間與轉換比。　　b. A/B比較(顏色、式樣，等等)。展示架上之某些較小產品具有多數選項，如顏色、口味，等等。　　　　• 玫瑰金是否比銀被操作更多? 　　　　• 藍色罐子是否比紅色罐子吸引更多互動? 2. 方向性印象：　　得知介於位置為基的印象與購物者的關注在何處之間的差異。假如其正觀看其在15英尺遠的產品(20秒)，則該印象不應考量其位於何處，而應考量其正在觀看何處。 3. 消費者辨識：　　記住重複購物者及其相關的電子郵件地址(由零售商以多種方式來收集)和購物輪廓。 4. 群組動態：　　決定購物者何時在觀看其他人與產品互動。　　　　• 回答該個人之後是否與該產品互動? 　　　　• 那些人是否一起進入商店、或者其可能是陌生人? 　　　　• 個人還是人群花比較多時間在商店中? 5. 消費者回陣：　　提供消費者目標資訊、公布商店經驗。此特徵可依各零售商而具有稍微不同的實施方式，取決於特定習慣及策略。其可能需要來自零售商之整合及/或開發以採取該特徵。　　• 購物者將被詢問其是否希望接收有關其可能有興趣的產品之通知。該步驟可被整合與收集電子郵件之商店的方法。　　• 在離開商店後，消費者可接收一封具有其在該商店中花了時間的產品之電子郵件。針對歷時、接觸、及目睹(方向印象)之互動臨限值將被決定。當該臨限值被滿足時，該些產品將進入她的列表且在她離開商店後不久被傳送給她。The following analysis features represent the additional capabilities of the platform. 6.2.1 Standard analysis 1. Product interaction: 粒度 Granular decomposition of product interaction, such as: a. Interaction time and conversion ratio for each product. B. A / B comparison (color, style, etc.). Some of the smaller products on the display stand have many options, such as color, taste, and so on. • Is rose gold manipulated more than silver? • Does the blue jar attract more interaction than the red jar? 2. Directional impressions: Learn where the location-based impression is and where the shoppers ’attention is difference. If he is watching his product 15 feet away (20 seconds), the impression should not consider where he is located, but where he is watching. 3. Consumer Identification: Remember repeat shoppers and their associated email addresses (collected by retailers in multiple ways) and shopping profiles. 4. Group dynamics: Decide when shoppers are watching other people interact with the product. • Do you interact with the product after answering the individual? • Do those people enter the store together, or may they be strangers? • Individuals or crowds spend more time in the store? 5. Consumers come back: Provide consumer target information Announce store experience. This feature may have slightly different implementations from retailer to retailer, depending on specific habits and strategies. It may require integration and / or development from retailers to take advantage of this feature.购物 • Shoppers will be asked if they want to receive notifications about products they may be interested in. This step can be integrated with the e-mail store method. • After leaving the store, consumers can receive an email with a product that they spend time in the store. Interaction thresholds for duration, contact, and witnessing (direction impressions) will be determined. When the threshold is met, the products will enter her list and be delivered to her shortly after she leaves the store.

此外，或替代地，購物者可在一段時間週期後被傳送一封電子郵件，其係提供折扣產品或其他特殊資訊。這些產品將是他們表達有興趣(但並未購買)的項目。 6.3 訪客應用程式In addition, or alternatively, the shopper may be sent an email after a period of time, which provides discounted products or other special information. These products will be items that they have expressed interest in (but have not purchased). 6.3 Guest application

購物者應用程式自動地幫人們結帳，當他們離開商店時。然而，平台並未要求購物者需具有或使用購物者應用程式才能使用該商店。The shopper app automatically helps people check out when they leave the store. However, the platform does not require shoppers to have or use a shopper app to use the store.

當購物者/人不具有或使用該購物者應用程式時，他們便走向服務台(iPad/平板或其他螢幕)或者他們走向預先安裝的自行結帳機器。該顯示(與該平台整合)將自動地顯示消費者的購物車。When shoppers / people do not own or use the shopper app, they go to the help desk (iPad / tablet or other screen) or they go to a pre-installed self-checkout machine. The display (integrated with the platform) will automatically show the consumer's shopping cart.

購物者將有機會檢視其顯示了什麼。假如他們同意該顯示上之資訊，則他們可以將現金投入該機器(假如該能力被建入硬體(例如，自行結帳機器)的話)或者他們刷信用卡或轉帳卡。他們可接著離開商店。Shoppers will have the opportunity to see what they show. If they agree with the information on the display, they can either put cash into the machine (if the capability is built into the hardware (eg, a self-checkout machine)) or they swipe a credit or debit card. They can then leave the store.

假如他們不同意該顯示，則商店人員被告知，藉由他們的選擇來透過觸控螢幕、按鈕、或其他手段提出質疑。(參見商店應用程式之下的商店輔助) 6.4 購物者應用程式If they disagree with the display, store personnel are told to use their choice to raise a challenge with a touch screen, button, or other means. (See Store Assistance under the Store App) 6.4 Shopper App

透過應用程式(購物者應用程式)之使用，消費者可帶著商品離開商店且自動地被收費並提供數位收據。購物者必須在當位於商店的購物區域內時之任何時刻開啟他們的應用程式。該平台將辨識其被顯示於購物者的裝置上之獨特影像。該平台將把他們綁定至他們的帳戶(消費者協會)，且無論他們是否保持該應用程式為開啟，將能夠在他們位於商店的購物區域內的所有時間記得他們是誰。Through the use of the application (shopper application), consumers can leave the store with the product and be automatically charged and provide a digital receipt. Shoppers must open their app at any time while in the shopping area of a store. The platform will identify unique images that are displayed on the shopper's device. The platform will bind them to their account (Consumer Association) and whether or not they leave the app open, they will be able to remember who they are all the time they are in the shopping area of the store.

當購物者收集項目時，購物者應用程式將顯示該些項目於購物者的購物車中。假如購物者想要，他們可以觀看有關他們所拾起(亦即，加入到他們的購物車)之各項目的產品資訊。產品資訊被儲存以該商店的系統或者被加至平台。用以更新該資訊之能力(諸如提供產品折扣或顯示價錢)為一種零售商可請求/購買或開發的選項。When a shopper collects items, the shopper application will display those items in the shopper's shopping cart. If shoppers want, they can view product information about the items they picked up (ie, added to their shopping cart). Product information is stored in the store's system or added to the platform. The ability to update this information, such as offering product discounts or displaying prices, is an option that retailers can request / purchase or develop.

當購物者把項目放下時，則其被移除自後端上以及購物者應用程式上之其購物車。When a shopper drops an item, it is removed from its shopping cart on the backend and on the shopper application.

假如購物者應用程式被開啟，且接著在消費者協會完成後被關閉，則該平台將維持購物者的購物車並正確地向他們收費(一旦他們離開該商店)。If the shopper application is opened and then closed after the consumer association is completed, the platform will maintain the shopper's shopping cart and charge them correctly (once they leave the store).

購物者應用程式亦具有關於其開發準則之映射資訊。其可告知消費者去何處找該商店中之項目，假如該消費者藉由鍵入搜尋項目以請求該資訊的話。在稍後的日子，吾人將取得購物者的購物列表(手動地鍵入該應用程式或者透過其他智慧型系統)並顯示通過該商店以收集所有想要的項目之最快速路由。其他過濾器(諸如「裝袋偏好」)可被加入。裝袋偏好過濾器係容許購物者不依循最快速路由，而是先收集最強韌的項目，接著稍後收集較易碎的項目。 7. 消費者的類型Shopper apps also have mapping information about their development guidelines. It can tell the consumer where to find items in the store, if the consumer requests the information by typing in a search item. At a later date, we will get the shopper's shopping list (by typing the app manually or through other smart systems) and show the fastest route through the store to collect all the desired items. Other filters (such as "bagging preferences") can be added. The bagging preference filter allows shoppers not to follow the fastest route, but to collect the strongest items first, and then the more fragile items later. 7. Types of consumers

會員消費者 - 第一類型的消費者係使用應用程式以登入該系統。該消費者被提示以一圖片且當她/他按壓時，該系統會將其鏈結至該消費者的內部id。假如該消費者具有帳戶，則該帳戶被自動地收費(當該消費者走出該商店時)。此為會員為基的商店。Member Consumers-The first type of consumer uses an application to log into the system. The consumer is prompted with a picture and when she / he presses, the system will link it to the consumer's internal id. If the consumer has an account, the account is automatically charged (when the consumer walks out of the store). This is a member-based store.

訪客消費者 - 不是每個商店將具有會員制度，或者消費者可能沒有智慧型手機或信用卡。此類型的消費者將向服務台。該服務台將顯示該消費者所具有的項目且將要求該消費者放入金錢。該服務台將已得知有關該消費者已購買的所有項目。針對此類型的消費者，該系統能夠識別該消費者是否尚未針對購物車中的項目付款，並提示在門上的收銀機(在該消費者到達那裡之前)以讓收銀機得知有關未付款的項目。該系統亦可針對一個尚未被付款的項目提示，或者該系統具有關一個項目的低信心。此被稱為預測路徑找尋。Visitor consumers-Not every store will have a membership system, or consumers may not have a smartphone or credit card. This type of consumer will ask the help desk. The help desk will display the items the consumer has and will ask the consumer to deposit money. The help desk will already know about all the items that the consumer has purchased. For this type of consumer, the system is able to identify whether the consumer has not paid for an item in the shopping cart and prompt the cash register on the door (before the consumer arrives there) to let the cash register know about the outstanding payment s project. The system can also prompt for an item that has not yet been paid, or the system has low confidence in a project. This is called predictive path finding.

該系統係根據信心位準以將顏色碼(綠或黃)指派給行走在該商店中的消費者。綠色編碼的消費者是已登入該系統或者是該系統具有關於他們的高信心。黃色編碼的消費者具有其尚未被預測以高信心的一或更多項目。店員可觀看黃色點並按壓它們以識別問題項目，走向該消費者並解決問題。 8. 分析The system assigns a color code (green or yellow) to a consumer walking in the store based on the confidence level. Green coded consumers are either logged into the system or the system has high confidence about them. Yellow-coded consumers have one or more items that have not been predicted with high confidence. The clerk can watch the yellow dots and press them to identify the problem item, approach the consumer, and resolve the problem. 8. Analysis

關於該消費者收集了一大群分析資訊，諸如消費者在特定貨架前方花費了多少時間。此外，該系統係追蹤消費者正觀看的位置(關於該系統之印象)，以及消費者拾起並放回貨架的項目。此等分析目前可用於電子商務但尚未可用於零售商店。 9. 功能性模組A large group of analytical information was collected about the consumer, such as how much time the consumer spent in front of a particular shelf. In addition, the system tracks where the consumer is looking (about the impression of the system), as well as items that the consumer picks up and puts back on the shelf. These analyses are currently available for e-commerce but not yet available for retail stores. 9. Functional modules

以下為功能性模組之列表：　　1. 使用同步化相機之商店中影像的系統擷取陣列。　　2. 用以識別影像中之關節、及各別人的關節之集合的系統。　　3. 用以使用關節集合來產生新人的系統。　　4. 用以使用關節集合來刪除幽靈人的系統。　　5. 用以藉由追蹤關節集合來隨著時間推移追蹤各別人的系統。　　6. 用以針對該商店中所存在的各人產生區建議的系統，其係指示手中之項目的SKU數(WhatCNN)。　　7. 用以履行針對區提議之獲取/放下分析的系統，其係指示手中的項目是被拾起或是被放在貨架上(WhenCNN)。　　8. 用以使用區提議及獲取/放下分析來產生每人之存貨陣列的系統(與啟發法、人的已儲存關節位置、及商店貨架上之項目的預先計算映圖結合之WhenCNN的輸出)。　　9. 用以識別、追蹤及更新貨架上之錯置的項目之位置的系統。　　10. 用以使用像素為基的分析來追蹤對於貨架上之項目的改變(獲取/放下)之系統。　　11. 用以履行商店之存貨稽查的系統。　　12. 用以識別手中之多數項目的系統。　　13. 用以使用購物腳本來收集來自商店之項目影像資料的系統。　　14. 用以履行結帳並從會員消費者收款的系統。　　15. 用以履行結帳並從訪客消費者收款的系統。　　16. 用以藉由識別購物車中之未付款項目來履行損失預防的系統。　　17. 用以使用顏色碼來追蹤消費者以協助店員識別消費者的購物車中之不正確識別的項目之系統。　　18. 用以產生消費者購物分析之系統，其包括位置為基的印象、方向性印象、A/B分析、消費者辨識、群組動態，等等。　　19. 用以使用購物分析來產生針對性的消費者回陣之系統。　　20. 用以產生商店之熱映圖重疊圖來視覺化不同活動的系統。The following is a list of functional modules: 1. Use a system capture array that synchronizes the images in the camera store. 2. A system for identifying the joints in an image and the joints of other people. 3. A system for generating new people using joint sets. 4. A system for removing ghosts using joint collections. 5. A system for tracking each other over time by tracking joint collections. 6. A system for generating zone suggestions for each person present in the store, which indicates the SKU number (WhatCNN) of the item in hand. 7. A system to perform acquisition / drop analysis for zone proposals, which indicates whether the item in hand is picked up or placed on a shelf (WhenCNN). 8. A system to generate a per-person inventory array using zone proposal and get / drop analysis (output of WhenCNN combined with heuristics, pre-calculated maps of person's stored joint positions, and items on store shelves) . 9. A system to identify, track, and update the location of misplaced items on the shelf. 10. A system for tracking (change / acquire) items on the shelf using pixel-based analysis. 11. A system for performing inventory inspections in stores. 12. A system to identify most items in hand. 13. A system for collecting project image data from stores using shopping scripts. 14. A system for performing checkouts and receiving payments from member consumers. 15. A system for fulfilling checkouts and receiving payments from guest consumers. 16. A system to perform loss prevention by identifying outstanding items in the shopping cart. 17. A system for tracking consumers using color codes to assist shop assistants in identifying incorrectly identified items in a consumer's shopping cart. 18. A system for generating consumer shopping analysis, including location-based impressions, directional impressions, A / B analysis, consumer identification, group dynamics, and more. 19. A system for using shopping analytics to generate targeted consumer returns. 20. A system for generating thermal map overlays of a store to visualize different activities.

文中所述之技術可支援無出納員結帳。去商店。取走東西。離開。The techniques described in this article can support billless cashiers. go to the shop. Take things away. go away.

無出納員結帳是一種純機器視覺及深學習為基的系統。購物者跳過排隊而更快速且更輕易地獲得他們想要的。無RFID標籤。對於商店的後端系統無改變。可與銷售及存貨管理系統之第三方點(3^rd party Point of Sale and Inventory Management systems)整合。　　　　每一視頻饋送之即時30 FPS分析。　　　　預置的、尖端的GPU叢集。　　　　辨識購物者以及與他們互動的項目。　　　　於範例實施例中並無網際網路依存性。　　　　多數最先進深學習模型(包括專屬訂製演算法)，用以首次解決機器視覺技術中的間隙。Cashierless checkout is a pure machine vision and deep learning based system. Shoppers skip the line and get what they want faster and easier. No RFID tags. No changes to the store's back-end system. Can ^{(3 rd party Point of Sale and} Inventory Management systems) integration with third-party point of sale and inventory management systems. Real-time 30 FPS analysis for each video feed. Pre-built, cutting-edge GPU clusters. Identify shoppers and the items they interact with. In the exemplary embodiment, there is no internet dependency. Most state-of-the-art deep learning models (including proprietary custom algorithms) are used for the first time to resolve gaps in machine vision technology.

技術及能力(Techniques & Capabilities)包括以下：　　1. 標準認知的機器學習管線係解決：　　　　a) 人檢測。　　　　b) 單體追蹤。　　　　c) 多數相機個人同意。　　　　d) 手檢測。　　　　e) 項目分類。　　　　f) 項目所有權解決。Techniques & Capabilities include the following: 1. Standard cognitive machine learning pipeline solves: a) People detection. B) Individual tracking. C) Most cameras agree personally. D) Hand detection. E) Project classification. F) Project ownership resolution.

結合這些技術，吾人可：　　1. 遍及其即時的購物經驗以追蹤所有人。　　2. 得知購物者手中有什麼、他們站在哪裡、以及他們放回了什麼。　　3. 得知購物者面對什麼方向以及多久。　　4. 辨識錯置的項目並履行24/7視覺推銷稽查。Combining these technologies, we can: 1. Track everyone through their instant shopping experience. 2. Learn what shoppers have in their hands, where they stand, and what they put back. 3. Learn what direction and how long shoppers face. 4. Identify misplaced items and perform 24/7 visual sales audit.

可檢測購物者手中以及其籃子中確實有什麼。學習你的商店：Detects what really is in the hands of shoppers and in their baskets. Learn from your store:

對於特定商店及項目所訓練的訂製神經網路。訓練資料可再使用橫跨所有商店位置。標準部署：Custom neural networks trained for specific stores and projects. Training materials are reusable across all store locations. Standard deployment:

天花板相機必須被安裝以該商店之所有區域的雙重覆蓋。針對一般走道需要介於2與6之間的相機。The ceiling camera must be installed with double coverage of all areas of the store. For general walkways, cameras between 2 and 6 are required.

預置GPU叢集可配適入後端辦公室中的一或兩個伺服器架。Pre-built GPU clusters fit into one or two server racks in the back office.

範例系統可整合與或者包括銷售及存貨管理系統之點。Example systems can be integrated with or include points for sales and inventory management systems.

使用同步化相機以擷取商店中之影像的陣列之第一系統、方法及電腦程式產品。First system, method and computer program product using an array of synchronized cameras to capture images in a store.

用以識別影像中之關節、及各別人的關節之集合的第二系統、方法及電腦程式產品。A second system, method, and computer program product for identifying a joint in an image and a joint of each other.

使用關節集合以產生新人的第三系統、方法及電腦程式產品。A third system, method, and computer program product using joint sets to generate newcomers.

使用關節集合以刪除幽靈人的第四系統、方法及電腦程式產品。A fourth system, method, and computer program product for removing ghost people using joint sets.

藉由追蹤關節集合以隨著時間推移追蹤各別人的第五系統、方法及電腦程式產品。A fifth system, method, and computer program product that tracks each other over time by tracking joint collections.

用以針對該商店中所存在的各人產生區建議的第六系統、方法及電腦程式產品，其係指示手中之項目的SKU數(WhatCNN)。A sixth system, method, and computer program product for generating a zone proposal for each person present in the store, which indicates the SKU number (WhatCNN) of the item in hand.

用以履行針對區提議之獲取/放下分析的第七系統、方法及電腦程式產品，其係指示手中的項目是被拾起或是被放在貨架上(WhenCNN)。The seventh system, method, and computer program product to perform acquisition / drop analysis for district proposals, which indicates whether the item in hand is picked up or placed on a shelf (WhenCNN).

使用區提議及獲取/放下分析以產生每人之存貨陣列的第八系統、方法及電腦程式產品(例如，與啟發法、人的已儲存關節位置、及商店貨架上之項目的預先計算映圖結合之WhenCNN的輸出)。Eighth system, method, and computer program product (e.g., heuristics, person's stored joint positions, and items on store shelves with pre-calculated maps) of the eighth system, method, and computer program product for use area proposal and acquisition / drop analysis Combined with the output of WhenCNN).

用以識別、追蹤及更新貨架上之錯置的項目之位置的第九系統、方法及電腦程式產品。Ninth system, method, and computer program product for identifying, tracking, and updating locations of misplaced items on a shelf.

使用像素為基的分析以追蹤對於貨架上之項目的改變(獲取/放下)之第十系統、方法及電腦程式產品。A tenth system, method, and computer program product that uses pixel-based analysis to track changes (acquisition / drop) to items on the shelf.

用以履行商店之存貨稽查的第十一系統、方法及電腦程式產品。An eleventh system, method, and computer program product for performing inventory inspection of a store.

用以識別手中之多數項目的第十二系統、方法及電腦程式產品。Twelfth system, method and computer program product for identifying most items in hand.

使用購物腳本以收集來自商店之項目影像資料的第十三系統、方法及電腦程式產品。Thirteenth system, method and computer program product using shopping script to collect image data of items from a store.

用以履行結帳並從會員消費者收款的第十四系統、方法及電腦程式產品。A fourteenth system, method, and computer program product for performing checkout and receiving payments from member consumers.

用以履行結帳並從訪客消費者收款的第十五系統、方法及電腦程式產品。A fifteenth system, method, and computer program product for performing checkout and receiving payments from guest consumers.

藉由識別購物車中之未付款項目以履行損失預防的第十六系統、方法及電腦程式產品。A sixteenth system, method, and computer program product that performs loss prevention by identifying outstanding items in a shopping cart.

使用(例如)顏色碼來追蹤消費者以協助店員識別消費者的購物車中之不正確識別的項目之第十七系統、方法及電腦程式產品。A seventeenth system, method, and computer program product that uses, for example, a color code to track a consumer to assist a store clerk to identify incorrectly identified items in the consumer's shopping cart.

用以產生消費者購物分析之第十八系統、方法及電腦程式產品，該些分析包括一或更多位置為基的印象、方向性印象、A/B分析、消費者辨識、群組動態，等等。An eighteenth system, method, and computer program product for generating consumer shopping analysis, the analysis includes one or more position-based impressions, directional impressions, A / B analysis, consumer identification, group dynamics, and many more.

使用購物分析以產生針對性的消費者回陣之第十九系統、方法及電腦程式產品。The nineteenth system, method, and computer program product that uses shopping analysis to generate targeted consumer returns.

用以產生商店之熱映圖重疊圖來視覺化不同活動的第二十系統、方法及電腦程式產品。Twentieth system, method and computer program product for generating thermal map overlays of a store to visualize different activities.

用於手檢測之第二十一系統、方法及電腦程式。Twenty-first system, method and computer program for hand detection.

用於項目分類之第二十二系統、方法及電腦程式。Twenty-two systems, methods and computer programs for item classification.

用於項目所有權解決之第二十三系統、方法及電腦程式。Twenty-third system, method and computer program for project ownership resolution.

用於項目人檢測之第二十四系統、方法及電腦程式。Twenty-fourth system, method and computer program for project person detection.

用於項目單體追蹤之第二十五系統、方法及電腦程式。Twenty-fifth system, method and computer program for tracking individual items.

用於項目多數相機個人同意之第二十六方法及電腦程式。Twenty-sixth method and computer program for personal consent of most cameras.

實質上如文中所述之用於無出納員結帳的第二十七系統、方法及電腦程式產品。The twenty-seventh system, method, and computer program product for essentially a cashier-free checkout as described herein.

系統1-26之任一者與任何其他系統或以上列出的系統1-26中之系統的組合。Any one of systems 1-26 in combination with any other system or systems in systems 1-26 listed above.

文中所述者為一種用以追蹤真實空間之區域中藉由主體之存貨項目的放下及取走之方法，包含：The article described is a method to track down and take off the inventory item of the subject in the area of real space, including:

使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列，各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊；Using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera overlapping the viewing fields of at least one other camera of the plurality of cameras;

接收來自該些複數相機之影像的該些序列，並使用第一影像辨識引擎以處理影像來產生其識別該真實空間中之該些已識別主體的主體及位置之第一資料集；Receiving the sequences of images from the plurality of cameras, and using a first image recognition engine to process the images to generate a first data set that identifies the subjects and locations of the identified subjects in the real space;

處理第一資料集以指明其包括影像之該些序列中的影像中之已識別主體的手之影像的定界框；Processing the first data set to specify a bounding box of an image of the identified subject's hand in the images of the sequences in which it includes the image;

接收來自該些複數相機之影像的該些序列，並處理該些影像中之該些已指明定界框以使用第二影像辨識引擎來產生該些已識別主體的手之分類，該分類包括該已識別主體是否正持有存貨項目、第一接近度類別，其係指示該已識別主體的手相對於貨架之位置、第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置、第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置、及可能存貨項目之識別符；以及Receiving the sequences of images from the plurality of cameras, and processing the designated bounding boxes in the images to use a second image recognition engine to generate a classification of the hands of the identified subjects, the classification including the Whether the identified subject is holding inventory items, the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, and the second proximity category, which indicates the hand of the identified subject relative to the identified subject The position of the body, the third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and an identifier of a possible inventory item; and

處理已識別主體的影像之該些序列中的影像之集合的手之類別以檢測由已識別主體取走存貨項目以及由已識別主體放下存貨項目於存貨展示結構上。The types of hands that process the set of images in the sequences of the identified subject are detected to detect the removal of the inventory item by the identified subject and the drop of the inventory item by the identified subject on the inventory display structure.

於此描述的方法中，該些第一資料集可針對各已識別主體包含具有真實空間中之座標的候選關節之集合。In the method described herein, the first data sets may include a set of candidate joints with coordinates in real space for each identified subject.

此描述的方法可包括處理該些第一資料集以指明定界框包括根據針對各主體的候選關節之該些集合中的關節之位置以指明定界框。The method described herein may include processing the first data sets to specify bounding boxes including specifying the bounding boxes based on the positions of the joints in the sets of candidate joints for each subject.

於此描述的方法中，該些第一及第二影像辨識引擎之一者或兩者可包含卷積神經網路。In the method described herein, one or both of the first and second image recognition engines may include a convolutional neural network.

此描述的方法可包括使用卷積神經網路以處理定界框之該些類別。The method described may include using a convolutional neural network to process these classes of bounding boxes.

描述一種電腦程式產品(及產品)，其包括電腦可讀取記憶體，其包含非暫態資料儲存媒體，儲存於該記憶體中之電腦指令可由電腦所執行以追蹤真實空間之區域中藉由主體的存貨項目之放下及取走，藉由文中所述之程序的任一者。Describe a computer program product (and product) that includes computer-readable memory that contains non-transitory data storage media, and computer instructions stored in the memory can be executed by a computer to track areas of real space by The entity's inventory items are dropped and taken away by any of the procedures described in the text.

描述一種系統，包含複數相機，其係產生包括主體的手之影像的序列；及處理系統，其係耦合至該些複數相機，該處理系統包括手影像辨識引擎，其係接收影像的該些序列以產生該手的類別於時間序列中、及邏輯，用以處理來自影像的該些序列之該手的該些類別來識別藉由該主體的動作，其中該動作為存貨項目的放下及取走之一。Describe a system including a plurality of cameras that generate a sequence of images including a subject's hand; and a processing system that is coupled to the plurality of cameras, the processing system including a hand image recognition engine that receives the sequences of images The category of the hand is generated in time series, and logic is used to process the categories of the hand from the sequences of the image to identify the action by the subject, where the action is the dropping and removing of the inventory item one.

該系統可包括邏輯，用以識別影像的該些序列中之該些影像中的該主體的關節之位置，及用以根據該些已識別關節來識別其包括該主體的該些手之相應影像中的定界框。The system may include logic to identify the positions of the joints of the subject in the images in the sequences of the images, and corresponding images to identify the hands that include the subject based on the identified joints. Bounding box in.

電腦程式列表附錄係依附於本說明書，且包括用以實施本申請案中所提供之系統的某些部分之電腦程式的範例之部分。該附錄包括啟發法之範例，用以識別主體之關節及存貨項目。該附錄提出電腦程式碼，用以更新主體的購物車資料結構。該附錄亦包括電腦程式常式，用以計算於卷積神經網路之訓練期間的學習率。該附錄包括電腦程式常式，用以將來自卷積神經網路之主體的手之分類結果儲存於來自各相機之每影像框的每主體之每手的資料結構中。The appendix of the list of computer programs is a part of an example of a computer program used to implement some parts of the system provided in this application. The appendix includes examples of heuristics to identify the subject's joints and inventory items. The appendix proposes computer code to update the main body's shopping cart data structure. The appendix also includes computer routines to calculate the learning rate during the training of a convolutional neural network. The appendix includes computer program routines to store the classification results of the subject's hands from the convolutional neural network in the data structure of each subject and each hand of each image frame from each camera.

100‧‧‧系統100‧‧‧ system

101a、101b、101c‧‧‧網路節點101a, 101b, 101c‧‧‧ network nodes

102‧‧‧網路節點102‧‧‧Network Node

110‧‧‧追蹤引擎110‧‧‧Tracking Engine

112a-112n‧‧‧影像辨識引擎112a-112n‧‧‧Image recognition engine

114‧‧‧相機114‧‧‧ Camera

116a‧‧‧走道116a‧‧‧Aisle

116b‧‧‧走道116b‧‧‧Aisle

116n‧‧‧走道116n‧‧‧Aisle

120‧‧‧調校器120‧‧‧ Tuner

140‧‧‧主體資料庫140‧‧‧Main database

150‧‧‧訓練資料庫150‧‧‧ training database

160‧‧‧啟發法資料庫160‧‧‧ Heuristics Database

170‧‧‧調校資料庫170‧‧‧ Calibration database

181‧‧‧網路181‧‧‧Internet

202‧‧‧貨架A202‧‧‧Shelf A

204‧‧‧貨架B204‧‧‧Shelf B

206‧‧‧相機A206‧‧‧ Camera A

208‧‧‧相機B208‧‧‧Camera B

216‧‧‧觀看域216‧‧‧viewing domain

218‧‧‧觀看域218‧‧‧viewing domain

220‧‧‧地板220‧‧‧ Floor

230‧‧‧屋頂230‧‧‧ roof

412-418‧‧‧相機412-418‧‧‧ Camera

422-428‧‧‧乙太網路為基的連接器422-428‧‧‧ Ethernet-based connector

430‧‧‧儲存子系統430‧‧‧Storage Subsystem

432‧‧‧主機記憶體子系統432‧‧‧Host Memory Subsystem

434‧‧‧隨機存取記憶體(RAM)434‧‧‧ Random Access Memory (RAM)

436‧‧‧唯讀記憶體(ROM)436‧‧‧Read Only Memory (ROM)

440‧‧‧檔案儲存子系統440‧‧‧File Storage Subsystem

442‧‧‧RAID 0442‧‧‧RAID 0

444‧‧‧固態硬碟(SSD)444‧‧‧Solid State Drive (SSD)

446‧‧‧硬碟驅動(HDD)446‧‧‧HDD

450‧‧‧處理器子系統450‧‧‧Processor Subsystem

454‧‧‧匯流排子系統454‧‧‧Bus Subsystem

462‧‧‧GPU 1462‧‧‧GPU 1

464‧‧‧GPU 2464‧‧‧GPU 2

466‧‧‧GPU 3466‧‧‧GPU 3

481‧‧‧網路481‧‧‧Internet

510‧‧‧輸入影像510‧‧‧ input image

520‧‧‧過濾器520‧‧‧Filter

530‧‧‧卷積層530‧‧‧ Convolutional layer

540‧‧‧輸出矩陣540‧‧‧ output matrix

702‧‧‧總體量度計算器702‧‧‧Overall Measurement Calculator

800‧‧‧主體資料結構800‧‧‧ main data structure

1411‧‧‧視頻程序1411‧‧‧Video Program

1415‧‧‧場景程序1415‧‧‧Scene Program

1452‧‧‧輸出1452‧‧‧ output

1453‧‧‧輸入1453‧‧‧Enter

1457‧‧‧輸出1457‧‧‧output

1502‧‧‧循環緩衝器1502‧‧‧Circular buffer

1504‧‧‧定界框產生器1504‧‧‧Delimited Box Generator

1506‧‧‧WhatCNN1506‧‧‧WhatCNN

1508‧‧‧WhenCNN1508‧‧‧WhenCNN

1510‧‧‧購物車資料結構1510‧‧‧Shopping cart data structure

1520‧‧‧影像頻道1520‧‧‧Video Channel

1522‧‧‧協調邏輯模組1522‧‧‧ Coordination logic module

2002‧‧‧啟發法2002‧‧‧ Heuristics

2111‧‧‧輸入2111‧‧‧Input

2113‧‧‧第一卷積層2113‧‧‧First Convolutional Layer

2115‧‧‧第二卷積層2115‧‧‧Second Convolution Layer

2117‧‧‧方盒2117‧‧‧square box

2119、2121‧‧‧層2119, 2121‧‧‧th floor

2123‧‧‧八個卷積層2123‧‧‧eight convolutional layers

2125、2127‧‧‧卷積層2125, 2127‧‧‧ Convolutional layer

2129‧‧‧八個卷積層2129‧‧‧eight convolutional layers

2135‧‧‧完全連接層2135‧‧‧Fully connected layer

2210‧‧‧「單手」模型2210‧‧‧ One-Handed Model

2212‧‧‧卷積層2212‧‧‧ Convolutional Layer

2214‧‧‧集用層2214‧‧‧Common use layer

2216‧‧‧區塊02216‧‧‧block 0

2218‧‧‧區塊12218‧‧‧Block 1

2220‧‧‧區塊22220‧‧‧Block 2

2222‧‧‧區塊32222‧‧‧Block 3

2310‧‧‧方盒2310‧‧‧square box

2312‧‧‧卷積層2312‧‧‧ Convolutional Layer

2314‧‧‧批次正規化層2314‧‧‧ Batch normalization layer

2316‧‧‧ReLU非線性2316‧‧‧ReLU Nonlinear

2318‧‧‧conv12318‧‧‧conv1

2320‧‧‧conv22320‧‧‧conv2

2322‧‧‧conv32322‧‧‧conv3

2324‧‧‧加法運算2324‧‧‧ Addition

2326‧‧‧跳躍連接2326‧‧‧jump connection

2410‧‧‧完全連接層(FC)2410‧‧‧Fully Connected Layer (FC)

2412‧‧‧再成形運算子2412‧‧‧ Reshaping Operator

2420‧‧‧下一層2420‧‧‧Next floor

2422‧‧‧MatMul2422‧‧‧MatMul

2424‧‧‧運算子2424‧‧‧ Operator

2426‧‧‧輸出2426‧‧‧ Output

2502‧‧‧第一部分2502‧‧‧ Part I

2504‧‧‧第二部分2504‧‧‧ Part Two

2506‧‧‧第三部分2506‧‧‧Part III

2508‧‧‧第四部分2508‧‧‧Part 4

2510‧‧‧第五部分2510‧‧‧Part 5

2512‧‧‧第六部分2512‧‧‧Part VI

2602‧‧‧第一影像處理器子系統2602‧‧‧First Image Processor Subsystem

2604‧‧‧第二影像處理器子系統2604‧‧‧Second Image Processor Subsystem

2606‧‧‧第三影像處理器子系統2606‧‧‧Third Image Processor Subsystem

2608‧‧‧選擇邏輯組件2608‧‧‧Select logic component

2702‧‧‧遮罩邏輯組件2702‧‧‧Mask logic components

2704‧‧‧背景影像儲存2704‧‧‧Background image storage

2706‧‧‧因數化影像2706‧‧‧factorized image

2710‧‧‧位元遮罩計算器2710‧‧‧Bit Mask Calculator

2714a-2714n‧‧‧改變CNN2714a-2714n‧‧‧ Change CNN

2718‧‧‧協調邏輯組件2718‧‧‧ Coordination logic component

2720‧‧‧日誌產生器2720‧‧‧Log Generator

2724‧‧‧遮罩產生器2724‧‧‧Mask generator

圖1闡明一種系統之架構階概圖，其中追蹤引擎係使用由影像辨識引擎所產生的關節資料以追蹤主體。FIG. 1 illustrates an architectural level diagram of a system in which a tracking engine uses joint data generated by an image recognition engine to track a subject.

圖2為闡明相機配置之購物商店中的走道之側視圖。FIG. 2 is a side view of the aisle in a shopping store illustrating the camera configuration.

圖3為闡明相機配置之購物商店中之圖2的走道之頂部視圖。FIG. 3 is a top view of the walkway of FIG. 2 in a shopping store illustrating a camera configuration.

圖4為一種相機及電腦硬體配置，其被組態以主控圖1之影像辨識引擎。FIG. 4 is a camera and computer hardware configuration configured to control the image recognition engine of FIG. 1.

圖5闡明卷積神經網路，其係闡明圖1之影像辨識引擎中的關節之識別。FIG. 5 illustrates a convolutional neural network, which illustrates the identification of joints in the image recognition engine of FIG. 1.

圖6顯示用以儲存關節資訊之範例資料結構。FIG. 6 shows an example data structure for storing joint information.

圖7闡明具有總體量度計算器之圖1的追蹤引擎。FIG. 7 illustrates the tracking engine of FIG. 1 with an overall metric calculator.

圖8顯示用以儲存包括相關關節之資訊的主體之範例資料結構。FIG. 8 shows an example data structure of a subject for storing information including related joints.

圖9為流程圖，其闡明藉由圖1之系統以追蹤主體的程序步驟。FIG. 9 is a flowchart illustrating the steps of a procedure for tracking a subject by the system of FIG. 1. FIG.

圖10為流程圖，其顯示圖9之相機校準步驟的更詳細程序步驟。FIG. 10 is a flowchart showing a more detailed procedure step of the camera calibration step of FIG. 9.

圖11為流程圖，其顯示圖9之視頻程序步驟的更詳細程序步驟。FIG. 11 is a flowchart showing more detailed program steps of the video program steps of FIG. 9.

圖12A為流程圖，其顯示圖9之場景程序的更詳細程序步驟之第一部分。FIG. 12A is a flowchart showing the first part of a more detailed program step of the scenario program of FIG. 9.

圖12B為流程圖，其顯示圖9之場景程序的更詳細程序步驟之第二部分。FIG. 12B is a flowchart showing the second part of the more detailed program steps of the scenario program of FIG. 9.

圖13為其中使用圖1之系統的實施例之環境的圖示。FIG. 13 is a diagram of an environment in which the embodiment of the system of FIG. 1 is used.

圖14為圖1之系統的實施例中之視頻及場景程序的圖示。FIG. 14 is a diagram of a video and scene program in the embodiment of the system of FIG. 1. FIG.

圖15A為概圖，其顯示具有包括關節CNN、WhatCNN及WhenCNN之多數卷積神經網路(CNN)的管線，用以產生真實空間中之每主體的購物車資料結構。FIG. 15A is a schematic diagram showing a structure of a shopping cart data having a majority convolutional neural network (CNN) including joints CNN, WhatCNN, and WhenCNN to generate each subject in real space.

圖15B顯示來自多數相機之多數影像頻道以及用於該些主體和其各別購物車資料結構之協調邏輯。FIG. 15B shows most image channels from most cameras and coordination logic for those subjects and their respective shopping cart data structures.

圖16為流程圖，其闡明用以識別並更新真實空間中之主體的程序步驟。FIG. 16 is a flowchart illustrating the procedure steps for identifying and updating a subject in a real space.

圖17為流程圖，其顯示用以處理主體之手關節來識別存貨項目的程序步驟。FIG. 17 is a flowchart showing the steps of a procedure for processing a hand joint of a subject to identify an inventory item.

圖18為流程圖，其顯示用於每手關節之存貨項目的時間序列分析以產生每主體之購物車資料結構的程序步驟。FIG. 18 is a flowchart showing program steps for time series analysis of inventory items of each hand joint to generate a shopping cart data structure for each subject.

圖19為圖15A之系統的實施例中之WhatCNN模型的圖示。FIG. 19 is a diagram of a WhatCNN model in the embodiment of the system of FIG. 15A.

圖20為圖15A之系統的實施例中之WhenCNN模型的圖示。FIG. 20 is a diagram of a WhenCNN model in the embodiment of the system of FIG. 15A.

圖21提出一識別卷積層之維度的WhatCNN模型之範例架構。Figure 21 presents an example architecture of a WhatCNN model that identifies the dimensions of the convolutional layer.

圖22提出用於手影像之類別的WhatCNN之實施例的高階方塊圖。Figure 22 presents a high-level block diagram of an embodiment of WhatCNN for categories of hand images.

圖23提出圖22中所提出之WhatCNN模型的高階方塊圖之第一方塊的細節。FIG. 23 presents details of the first block of the high-order block diagram of the WhatCNN model proposed in FIG. 22.

圖24提出圖22中所提出之範例WhatCNN模型中的完全連接層中之運算子。FIG. 24 presents the operators in the fully connected layer in the example WhatCNN model proposed in FIG. 22.

圖25為一儲存為針對WhatCNN模型之訓練資料集的部分之影像檔的範例名稱。FIG. 25 is an example name of an image file stored as part of a training data set for a WhatCNN model.

圖26為一種用以追蹤藉由真實空間之區域中的主體之改變的系統之高階架構，其中選擇邏輯係於使用背景語意差異(diffing)的第一檢測與使用前台區提議的冗餘檢測之間選擇。FIG. 26 is a high-level architecture of a system for tracking changes by subjects in a region of real space, where the selection logic is based on the first detection using background semantic diffing and the redundant detection using the foreground area proposal Choose between.

圖27提出其實施圖26之系統的子系統之組件。FIG. 27 presents the components of a subsystem that implements the system of FIG. 26.

圖28A為流程圖，其顯示用於判定存貨事件及購物車資料結構之產生的詳細程序步驟之第一部分。FIG. 28A is a flowchart showing the first part of the detailed procedure steps for determining the generation of the inventory event and the shopping cart data structure.

圖28B為流程圖，其顯示用於判定存貨事件及購物車資料結構之產生的詳細程序步驟之第二部分。FIG. 28B is a flowchart showing the second part of the detailed procedure steps for determining the occurrence of the inventory event and the shopping cart data structure.

Claims

A system for tracking multi-joint subjects in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generate respective sequences of images of corresponding viewing fields in the real space, and each of the cameras should view The domain is overlapped with the viewing domain of at least one other camera of the plurality of cameras; a processing system coupled to the plurality of cameras, the processing system includes: an image recognition engine receiving the images from the plurality of cameras; These sequences process images to generate corresponding arrays of joint data structures, and the arrays of joint data structures corresponding to specific images are obtained by the type of joint, the time of the specific image, and the coordinates of the components in the specific image Classify the elements of the specific images; a tracking engine configured to receive the arrays of joint data structures corresponding to the images in a sequence of images from cameras with overlapping viewing domains, and will correspond to the The blocks of the joint data structure of the image the blocks of the elements in the arrays Transforming the coordinates into candidate joints with coordinates in the real space; and unitary logic to identify a set of candidate joints with coordinates in the real space as multi-joint subjects in the real space.

For example, the system of claim 1 in the patent application scope, wherein the image recognition engines include a convolutional neural network.

For example, the system of claim 1 in which the image recognition engine processes the image to generate a confidence array for the components of the image, wherein the confidence array of the specific component of the image includes the confidence value of the multiple joint types of the specific component, and according to The confidence array selects a joint type of the joint data structure for the specific component.

For example, the system of claim 1 in which the logic used to identify the set of candidate joints includes a heuristic function based on the physical relationship between the joints of the subject in the real space to identify the set of candidate joints as multiple joints main body.

For example, the system of claim 4 includes logic to store the sets of joints identified as multi-joint subjects, and the logic to identify sets of candidate joints includes logic to determine Whether the identified candidate joints in the time-acquired image meet members of one of those sets of candidate joints that have been identified as multiple joint subjects in the previous image.

For example, the system of claim 1, wherein the cameras in the plurality of cameras are configured to generate a synchronization sequence of images.

For example, the system of claim 1 in the patent scope, wherein the plurality of cameras include a camera disposed above and having a viewing field covering various parts of the area in real space, and which are identified as candidate joints of a multi-joint subject The coordinates in the real space of the members of the set identify positions in the region of the multi-joint subject.

For example, the system of claim 1 includes logic to track the position of a plurality of multi-joint subjects in the area of real space.

For example, the system of claim 8 includes logic to determine when the multi-joint subject among the plurality of multi-joint subjects leaves the area of real space.

For example, the system of claim 1 includes logic to track the positions of the majority of candidate joints in the region of real space where they are members of a set of candidate joints identified as a specific multi-joint subject.

A method for tracking multi-joint subjects in a region of real space, comprising: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing field of each camera and the plurality of cameras The viewing fields of at least one of the other cameras overlap; processing the images in the sequences of the image to generate corresponding arrays of joint data structures, the arrays corresponding to the joint data structure of a particular image are determined by the joint type, the particular The time of the image and the coordinates of the components in the specific image to classify the components of the specific image; transform the coordinates of the components in the arrays of the joint data structures corresponding to the images in different sequences Are candidate joints having coordinates in the real space; and a set of candidate joints having coordinates in the real space are identified as multi-joint subjects in the real space.

For example, the method of claim 11 in which the processing image includes using a convolutional neural network.

For example, the method of claim 11 in which the processing image includes generating a confidence array for the components of the image, wherein the confidence array of a specific component of the image includes a confidence value of a plurality of joint types of the specific component, and according to the confidence array To select the joint type of the joint data structure for the specific component.

For example, the method of claim 11, wherein identifying the set of candidate joints includes applying a heuristic function based on the physical relationship between the joints of the subjects in the real space to identify the set of candidate joints as multi-joint subjects.

For example, the method of claim 14 includes storing the sets of joints that are identified as multi-joint subjects, and identifying the set of candidate joints includes determining whether the candidate joints identified in the images obtained at a specific time meet It is identified as a member of one of these sets of candidate joints of the multi-joint subject in the previous image.

For example, the method of claim 11 in which the sequences of the image are synchronized.

For example, the method of claim 11 of the patent scope, wherein the plurality of cameras include a camera disposed above and having a viewing field covering various parts of the area in real space, and which are identified as candidate joints of a multi-joint subject The coordinates in the real space of the members of the set identify positions in the region of the multi-joint subject.

For example, the method of claim 11 includes tracking the position of a plurality of multi-joint subjects in the area of real space.

For example, the method of claim 18 of the scope of patent application includes determining when the articulated subject among the plurality of articulated subjects leaves the area of real space.

The method as claimed in item 11 of the patent application scope includes tracking the position of a majority of candidate joints that are members of a set of candidate joints identified as a particular multi-joint subject in that region of real space.

A computer program product comprising: computer readable memory, which contains non-transitory data storage media; 之 computer instructions stored in the memory, which can be executed by a computer to track a region of real space by a program The multi-articulated subject of the program includes: using a sequence of images from a plurality of cameras having corresponding viewing fields in real space, the viewing field of each camera overlapping the viewing field of at least one other camera of the plurality of cameras ; Processing the images in the sequences of the image to generate corresponding arrays of joint data structures, the arrays corresponding to the joint data structure of a particular image are by joint type, time of the particular image, and components in the particular image The coordinates of the specific images to classify the components of the particular image; transform the coordinates of the components in the arrays corresponding to the joint data structure of the images in different sequences into candidate joints with the coordinates in the real space ; And will have Identifying a set of candidate subject in the real space of the joint as much as for the joint body.

For example, the product in the scope of patent application No. 21, wherein processing the image includes using a convolutional neural network.

For example, the product of the scope of application for patent No. 21, wherein the processed image includes generating a confidence array for the component of the image, wherein the confidence array of the specific component of the image includes the confidence value of the multiple joint types of the specific component, and according to the confidence array To select the joint type of the joint data structure for the specific component.

For example, the product in the scope of application for patent No. 21, wherein identifying the set of candidate joints includes applying a heuristic function based on the physical relationship between the joints of the subjects in the real space to identify the set of candidate joints as multi-joint subjects.

For example, the product in the 24th scope of the patent application includes storing these sets of joints identified as multi-joint subjects, and the set of identifying candidate joints includes determining whether the candidate joints identified in the images obtained at a specific time meet It is identified as a member of one of these sets of candidate joints of the multi-joint subject in the previous image.

For example, the product in the scope of patent application 21, wherein the sequences of the image are synchronized.

For example, the product under the scope of patent application No. 21, wherein the plurality of cameras include a camera disposed above and having a viewing field covering various parts of the area in real space, and which are identified as candidate joints of a multi-joint subject The coordinates in the real space of the members of the set identify positions in the region of the multi-joint subject.

For example, the product in the scope of patent application No. 21 includes tracking the position of a plurality of multi-joint subjects in the area of real space.

For example, the product under the scope of patent application No. 28 includes determining when the articulated subject among the plurality of articulated subjects leaves the area of real space.

For example, the product under the scope of patent application No. 21 includes tracking the positions of the majority of candidate joints that are members of a set of candidate joints identified as a specific multi-joint subject in the region of real space.

A system for tracking the dropping and taking out of inventory items of a subject in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generating respective sequences of images of corresponding viewing fields in the real space, each camera The viewing field overlaps with the viewing field of at least one other camera among the plurality of cameras; a processing system coupled to the plurality of cameras, the processing system including a plurality of image recognition engines receiving the plurality of cameras from the plurality of cameras; A corresponding sequence of images, the image recognition engine in the plurality of image recognition engines processes the images in the corresponding sequences to identify a subject displayed in the images; and logic for processing which includes the identified A collection of images in the sequences of the subject's image to detect that the inventory item is taken away by the identified subject and the inventory item is dropped by the identified subject on the shelf.

For example, the system for applying item 31 of the patent scope, wherein the logic for processing the collection of images includes: For the identified subjects, the logic for processing the images to generate the categories of the images of the identified subjects, the categories Including: whether the identified subject is holding inventory items, a first proximity category, which indicates the position of the identified subject's hand relative to the shelf, and a second proximity category, which indicates that the identified subject's hand is relative to the The location of the identified subject's body, the third proximity category, are identifiers indicating the position of the identified subject's hand relative to the basket associated with the identified subject, and the possible inventory item.

For example, the system of claim 32 includes logic to perform time series analysis of the categories covering the images to detect the removal and the drop by the identified subjects.

For example, the system of claim 31, wherein the logic for processing the collection of images includes: For identified subjects, logic to identify the hand in the images in the collections representing the images of the identified subjects The bounding boxes of the data, and the types of data in the bounding boxes that are used to process the data in the bounding boxes to generate the identified subjects.

For example, if the system of claim 34 is applied, the categories include: whether the identified subject is holding inventory items, and the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, the second proximity Degree category, which indicates the position of the identified subject's hand relative to the body of the identified subject, a third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and possibly Identifier of the inventory item.

For example, the system of claim 34 includes logic to perform time series analysis of the categories covering the data in the bounding boxes in the collections of the images to detect the Take them away and let them go.

The system of claim 31 includes a circular buffer that is coupled to the cameras of the plurality of cameras to store a collection of images in the sequences of images from the plurality of cameras.

For example, the system of claim 31, wherein the logic for processing a collection of images includes a convolutional neural network.

For example, the system of claim 31, wherein the cameras in the plurality of cameras are configured to generate a synchronization sequence of images.

For example, the system of claim 31, wherein the plurality of cameras include cameras arranged above and having viewing fields covering respective parts of the area in real space.

The system of claim 31, including logic, includes logic that generates a log data structure in response to the detected take and drop, and includes a list of inventory items for each identified entity.

A method for tracking a subject dropping and removing inventory items in an area of real space, the method comprising: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera being Overlaps with the viewing field of at least one other camera of the plurality of cameras; receives a corresponding sequence of images from the plurality of cameras, and uses an image recognition engine in the plurality of image recognition engines to process the images in the corresponding sequences And identifying the subjects represented in the images, wherein the plurality of image recognition engines are part of a processing system coupled to the plurality of cameras; and processing the images in the sequences that include the images of the identified subjects Collect to detect that the inventory item is taken by the identified subject and that the inventory item is dropped by the identified subject on the shelf.

For example, the method of applying for the scope of patent scope No. 42, wherein the set of processed images includes: For the identified subjects, the categories of the images of the identified subjects are generated, and the categories include: whether the identified subjects are holding inventory Item, the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, the second proximity category, which indicates the position of the identified subject's hand relative to the body of the identified subject, the third proximity The degree category indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and an identifier of a possible inventory item.

For example, the method of claim 43 of the scope of patent application includes performing a time series analysis of the categories covering the images to detect the removal and the drop by the identified subjects.

For example, the method of applying for the scope of patent scope No. 42, wherein the set of processed images includes: for the identified subjects, a delimited frame identifying the data of the hand in the images in the sets representing the images of the identified subjects, and The data in the bounding boxes are processed to generate categories of data in the bounding boxes for the identified subjects.

For example, the method of applying for item 45 of the patent scope, wherein the categories include: whether the identified subject is holding inventory items, the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, the second proximity Degree category, which indicates the position of the identified subject's hand relative to the body of the identified subject, a third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and possibly Identifier of the inventory item.

For example, the method of applying scope 45 of the patent includes performing time series analysis of the categories of data in the bounding boxes in the collections of the images to detect the removal by the identified subjects. And those to put down.

The method of claim 42 includes a circular buffer that is coupled to the cameras in the plurality of cameras to store a collection of images in the sequences of images from the plurality of cameras.

The method of claim 42 includes applying a convolutional neural network to process a collection of images.

For example, the method of claim 42 in which the cameras in the plurality of cameras are configured to generate a synchronized sequence of images.

For example, the method of claim 42 in the patent application range, wherein the plurality of cameras include cameras disposed above and having viewing fields covering respective parts of the area in real space.

The method of applying for a patent scope item 42 includes generating a log data structure in response to a detected take and drop, which includes a list of inventory items for each identified entity.

A system for tracking the dropping and taking out of inventory items of a subject in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generating respective sequences of images of corresponding viewing fields in the real space, each camera The viewing field overlaps with the viewing field of at least one other camera among the plurality of cameras; a processing system coupled to the plurality of cameras, the processing system including: a first image recognition engine receiving the plurality of complex numbers; The sequences of camera images are processed to generate a first data set that identifies the subjects and locations of the identified subjects in the real space; logic to process the first data set to indicate that it includes images The bounding box of the image of the hand of the identified subject in the images in the sequences; The second image recognition engine receives the sequences of the images from the plurality of cameras, which processes the ones in the images The bounding box has been specified to generate a classification of the hands of the identified subjects, the classification includes: Whether the identified subject is holding inventory items, the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, and the second proximity category, which indicates the hand of the identified subject relative to the identified subject The position of the body, the third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and an identifier of a possible inventory item; and logic to process the image of the identified subject The types of hands in the collection of images in these sequences detect the removal of the inventory item by the identified subject and the placement of the inventory item by the identified subject on the shelf.

For example, the system of claim 53 includes a circular buffer that is coupled to the cameras in the plurality of cameras to store a collection of images in the sequences of images from the plurality of cameras.

For example, the system of claim 53 in the patent application scope, wherein the first data sets include a set of candidate joints for each identified subject having coordinates in real space.

If the system of claim 53 is applied, the logic for processing the first data sets to specify the bounding boxes is to specify the bounding boxes based on the positions of the joints in the sets of candidate joints for each subject. .

For example, the system of claim 53 in which the second image recognition engine includes a convolutional neural network.

For example, the system of claim 53, wherein the logic for processing the classes of the bounding box includes a convolutional neural network.

For example, the system of claim 53, wherein the cameras in the plurality of cameras are configured to generate a synchronization sequence of images.

For example, the system of claim 53, wherein the plurality of cameras include cameras disposed above and having viewing fields covering respective parts of the area in real space.

For example, the system for applying for the scope of patent No. 53 includes logic to generate a log data structure including a list of inventory items for each identified entity.

A method for tracking subjects dropping and taking inventory items in an area of real space, including: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera and the The viewing fields of at least one other camera of the plurality of cameras overlap; receive the sequences of images from the plurality of cameras, and use a first image recognition engine to process the images to generate the recognition of the already in the real space A first data set identifying the subject's subject and location; processing the first data set to specify a bounding box of the image of the identified subject's hand in the images in the sequences that include the image; receiving data from the plurality of cameras The sequences of the images, and processing the indicated bounding boxes in the images to use a second image recognition engine to generate a classification of the hands of the identified subjects, the classification including: whether the identified subjects are holding Inventory item, first proximity category, which indicates the position of the identified subject's hand relative to the shelf, second Proximity category, which indicates the position of the identified subject's hand relative to the body of the identified subject, and third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and An identifier of a possible inventory item; and the type of hand that processes the collection of images in those sequences of the image of the identified subject to detect the removal of the inventory item by the identified subject and the placement of the inventory item by the identified subject on the shelf.

The method of claim 62 includes a circular buffer that is coupled to the cameras in the plurality of cameras to store a collection of images in the sequences of images from the plurality of cameras.

For example, the method according to item 62 of the patent application, wherein the first data sets include a set of candidate joints for each identified subject having coordinates in real space.

For example, the method of claim 62, wherein processing the first data sets to specify bounding boxes includes specifying bounding boxes based on the positions of the joints in the sets of candidate joints for each subject.

For example, the method according to item 62 of the patent application, wherein the second image recognition engines include a convolutional neural network.

The method of claim 62 includes applying a convolutional neural network to deal with these classes of bounding boxes.

For example, the method of claim 62, wherein the cameras in the plurality of cameras are configured to generate a synchronization sequence of images.

For example, the method according to item 62 of the patent application, wherein the plurality of cameras include cameras arranged above and having a viewing field covering various parts of the area in the real space.

If the method of applying for the scope of patent No. 62 includes generating a log data structure including a list of inventory items for each identified entity.

A computer program product comprising: computer readable memory, which contains non-transitory data storage media; 之 computer instructions stored in the memory, which can be executed by a computer to track a region of real space by a program The subject lowers and removes the inventory item, the procedure includes: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing field of each camera and at least one other camera of the plurality of cameras The viewing domains overlap; (i) receiving the corresponding sequences of images from the plurality of cameras, using the image recognition engine in the plurality of image recognition engines to process the images in the corresponding sequences and identifying the subjects represented in the images, Wherein the plurality of image recognition engines are part of a processing system coupled to the plurality of cameras; and processing a collection of images in the sequences including images of the identified subjects to detect removal of inventory items by the identified subjects And the inventory item is dropped by the identified entity On.

For example, if the product of the scope of patent application No. 71, the set of processed images includes: For the identified subjects, the types of the images of the identified subjects are generated, and the categories include: whether the identified subjects are holding inventory Item, the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, the second proximity category, which indicates the position of the identified subject's hand relative to the body of the identified subject, the third proximity The degree category indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and an identifier of a possible inventory item.

For example, the product of the scope of patent application No. 72 includes performing a time series analysis of the categories covering the images to detect the removal and the drop of the identified subjects.

For example, the product in the scope of patent application 71, wherein the set of processed images includes: for the identified subject, a delimited frame identifying the data of the hand in the images in the sets representing the images of the identified subjects, and The data in the bounding boxes are processed to generate categories of data in the bounding boxes for the identified subjects.

For example, if the product under the scope of patent application No. 74, the categories include: whether the identified subject is holding inventory items, the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, the second proximity Degree category, which indicates the position of the identified subject's hand relative to the body of the identified subject, a third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and possibly Identifier of the inventory item.

If the product under the scope of patent application No. 74 includes performing the time series analysis of the categories of the data in the bounding boxes in the collections covering the images to detect the removal by the identified subjects And those to put down.

The product of claim 71 includes a circular buffer that is coupled to the cameras in the plurality of cameras to store a collection of images in the sequences of images from the plurality of cameras.

For example, the product under the scope of patent application 71 includes the use of a convolutional neural network to process a collection of images.

For example, the product in the scope of patent application 71, wherein the cameras in the plurality of cameras are configured to generate a synchronization sequence of images.

For example, the product of the scope of patent application No. 71, wherein the plurality of cameras include cameras disposed above and having a viewing field covering each part of the area in the real space.

For example, the product under the scope of patent application 71 includes a log data structure generated in response to the detected take-off and drop, which includes a list of inventory items for each identified entity.

A computer program product comprising: computer-readable memory, which contains non-transitory data storage media; 之 computer instructions stored in the memory, which can be executed by a computer to track down and remove subjects in areas of real space Inventory items include: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera overlapping the viewing fields of at least one other camera among the plurality of cameras; receiving The sequences of images from the plurality of cameras, and using a first image recognition engine to process the images to generate a first data set that identifies the subjects and locations of the identified subjects in the real space; process the first data Set the bounding box that indicates the image of the hand of the identified subject in the images that include the sequences of the images; receive the sequences of the images from the plurality of cameras and process the existing ones of the images Specify the bounding box to use the second image recognition engine to generate the The classification of the identified subject's hand, the classification includes: whether the identified subject is holding inventory items, the first proximity category, which indicates the position of the identified subject's hand relative to the shelf, the second proximity category, which Indicates the position of the identified subject's hand relative to the body of the identified subject, the third proximity category, which indicates the position of the identified subject's hand relative to the basket associated with the identified subject, and the identification of possible inventory items And the types of hands that process the collection of images in those sequences of the image of the identified subject to detect the removal of the inventory item by the identified subject and the placement of the inventory item by the identified subject on the shelf.

For example, the product under the scope of patent application No. 82 includes a circular buffer, which is a collection of images in the sequences coupled to the cameras in the plurality of cameras to store images from the plurality of cameras.

For example, the product in the scope of patent application No. 82, wherein the first data sets include a set of candidate joints for each identified subject having coordinates in real space.

For example, applying for a product in the 82nd patent scope, wherein processing the first data sets to specify the bounding boxes includes specifying the bounding boxes based on the positions of the joints in the sets of candidate joints for each subject.

For example, the product under the scope of patent application No. 82, wherein the second image recognition engines include a convolutional neural network.

For example, the product under the scope of patent application No. 82 includes the use of convolutional neural networks to deal with these categories of bounding boxes.

For example, the product under the scope of patent application No. 82, wherein the cameras in the plurality of cameras are configured to generate a synchronization sequence of images.

For example, the product under the scope of patent application No. 82, wherein the plurality of cameras include cameras disposed above and having a viewing field covering various parts of the area in real space.

If the product of the scope of patent application is No. 82, it includes a log data structure that generates a list including inventory items for each identified entity.

A system comprising: a camera that generates a sequence including an image of a hand; a processing system that is coupled to the camera, the processing system including a hand image recognition engine that receives the sequences of the image to generate a category of the hand In time series, and logic, the categories of the hand used to process the sequences from the image to identify actions by the subject.

For example, the system for applying for the scope of the patent No. 91, wherein the actions are to put down and remove the inventory items.

For example, the system of claim 91 includes logic to identify the positions of the joints of the subjects in the sequences in the sequences of the images, and to identify that they include the subjects based on the identified joints. The bounding box in the corresponding image of the hands.

A system for tracking changes in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generating respective sequences of images of corresponding viewing fields in the real space, and the viewing fields of each camera Overlaps with the viewing field of at least one other camera of the plurality of cameras; a processing system coupled to the plurality of cameras, the processing system includes: a first image processor including a subject image recognition engine that receives a plurality of images from a plurality of cameras; The corresponding sequence of camera images is to process the images to identify the subjects represented in those images in the corresponding sequence of images; a second image processor that includes a background image recognition engine that receives the corresponding images from multiple cameras A sequence that masks the identified subjects to produce a masked image, and processes the masked images to identify and classify the background changes represented in the images in the corresponding sequences of images.

For example, the system of claim 94, wherein the background image recognition engines include a convolutional neural network.

For example, the system for applying scope of patent No. 94 includes logic for associating an identified background change with an identified subject.

For example, the system of claim 94, wherein the second image processors include background image storage for storing background images of the corresponding sequences of the image; mask logic for processing the sequences of the image The background image data of the corresponding sequences of the background images from the images replaces the front stage image data representing the identified subjects to provide the occluded images.

For example, the system of claim 97, wherein the masking logic is a combination of N masked images in the sequences of images to generate a sequence of factorized images for each camera, and the second image processors This sequence of factorized images is processed to identify and classify background changes.

For example, the system of claim 94, wherein the second image processors include logic for generating changed data structures for the corresponding sequences of the images, and the changed data structures include the identified background changes. The coordinates in the occluded image, the identifiers of the inventory item subjects of the identified background changes, and the categories of the identified background changes; and coordination logic to process the changed data structure from the collection of cameras with overlapping viewing fields to Find out the identified background changes in real space.

For example, if the system of claim 94 is applied, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image.

If the system of claim 99 is applied, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image; and includes logic For associating the background change with the identified subject, and for performing the detection of the inventory items taken by the identified subjects and the detection of the inventory items being dropped by the identified subjects on the shelf.

For example, the system for applying for item 94 of the patent scope includes logic for associating background changes with identified entities, and for performing inspections of items removed by the identified entities and dropping inventory items by the identified entities. Inspection on shelves.

For example, the system of claim 94, wherein the first image processors identify the positions of the hands of the identified subjects; and include logic for comparing the changed positions with the hands of the identified subjects These positions are used to associate the background change with the identified subjects, and to perform the detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped on the shelves by the identified subjects.

For example, the system for applying for the item No. 94 includes a third image processor, which includes a front-end image recognition engine that receives corresponding sequences of images from the plurality of cameras, and processes the images to identify and classify the corresponding sequences of images. The foreground changes shown in these images.

For example, the system for applying for the scope of patent No. 104 includes logic for associating a background change with an identified subject, and for performing an inspection of the inventory items taken by the identified subjects and dropping the inventory items by the identified subjects. The first set of inspections on the shelves; logic for associating front desk changes with identified entities, and for performing inspections of inventory items taken by the identified entities and dropping inventory items by the identified entities at A second set of inspections on the shelf; and selection logic for processing the first and second sets of inspections to generate a log data structure that includes a list of inventory items for identified entities.

For example, the system of claim 94, wherein the sequences of the images from the cameras in the plurality of cameras are synchronized.

A system for tracking the dropping and taking out of inventory items of a subject in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generating respective sequences of images of corresponding viewing fields in the real space, each camera The viewing field overlaps with the viewing field of at least one other camera of the plurality of cameras; a processing system coupled to the plurality of cameras, the processing system including: a first image processor including a subject image recognition The engine receives a corresponding sequence of images from a plurality of cameras, which processes the images to identify the subjects represented in the images in the corresponding sequence of the images; a second image processor, which includes a background image recognition engine, receives the A corresponding sequence of camera images that masks the identified subjects to produce a masked image, processes the masked images to identify and classify the background changes represented in the images in the corresponding sequences of images, and Handle identified background changes to perform removal from identified subjects The first set of detection of goods items and detection of items being put on the shelf by the identified subject; a third image processor, which includes a front-end image recognition engine that receives the corresponding sequences of images from the plurality of cameras, which are processed Images to identify and classify the foreground changes represented in those images in those corresponding sequences of images, and to process the identified foreground changes to perform the detection of inventory items taken by the identified subject and drop the inventory items by the identified subject in A second set of inspections on the shelf; and selection logic for processing the first and second sets of inspections to generate a log data structure that includes a list of inventory items for identified entities.

A system for tracking the dropping and taking out of inventory items of a subject in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generating respective sequences of images of corresponding viewing fields in the real space, each camera The viewing field overlaps with the viewing field of at least one other camera of the plurality of cameras; and a processing system coupled to the plurality of cameras, the processing system includes logic for detecting inventory items by: Drop and take away: semantically identify significant changes in inventory items on the inventory display structure and associate those significant changes with the subjects represented in the sequences of the images.

For example, a system with a scope of 108 patent applications, wherein the processing system includes logic to detect the dropping and fetching of inventory items by identifying the posture of the subject and the inventory items related to the postures represented in the sequences of the images go.

A system for tracking the dropping and taking out of inventory items of a subject in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generating respective sequences of images of corresponding viewing fields in the real space, each camera The viewing domain overlaps the viewing domain of at least one other camera among the plurality of cameras; and a processing system coupled to the plurality of cameras, the processing system includes logic for identifying a subject ’s posture and The inventory items related to the gestures represented in the sequences of the images are used to detect the drop and removal of the inventory items.

A method for tracking changes in areas of real space, including: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera and at least one of the plurality of cameras The viewing domains of other cameras overlap; using a first image processor including a subject image recognition engine to process the images to identify subjects represented in the images in the corresponding sequences of the images; using a second image processing Device, which includes a background image recognition engine to mask the identified subjects in the images in the sequences of the image to generate masked images, and to process the masked images to identify and classify the corresponding sequences in the image The background shown in those images changes.

For example, the method of claim 111, wherein the background image recognition engines include a convolutional neural network.

For example, the method of applying for item 111 of the patent scope includes associating the identified background change with the identified subject.

For example, the method of applying for item No. 111, wherein using the second image processors includes storing background images corresponding to the corresponding sequences of the images; processing the images in the sequences of the images and using the corresponding sequences from the images. The background image data of the background images replaces the front stage image data representing the identified subjects to provide the occluded images.

For example, the method of claim 114, wherein processing the images in the sequences of images includes combining a set of N occluded images in the sequences of images to generate a sequence of factorized images for each camera, and the These second image processors process the sequence of factorized images to identify and classify background changes.

For example, the method of claim 111, wherein using the second image processors includes generating changed data structures for the corresponding sequences of the images, and the changed data structures include the masked images whose identified background changes. , The identifiers of the inventory item subjects of the identified background changes, and the categories of the identified background changes; and processing the changed data structure from the set of cameras with overlapping viewing domains to find those in real space A background change has been identified.

For example, the method of applying for item 116 of the patent scope, wherein the categories of the identified background change in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image.

If the method of applying for the scope of patent No. 116, wherein the categories of the identified background change in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image; and includes using The background change is associated with the identified entities, and the detection of the inventory items taken by the identified entities and the detection of the inventory items dropped by the identified entities on the shelves are performed.

For example, the method of applying for item No. 111 of the patent scope includes: associating a background change with the identified subject, and performing a test of the inventory items taken by the identified subjects and a test of the inventory items being dropped by the identified subjects on the shelf.

For example, the method of applying for item No. 111, wherein using the first image processors includes identifying the positions of the hands of the identified subjects; and including comparing the positions of the changes with the positions of the hands of the identified subjects Position to associate the background change with the identified subject, and to perform the detection of the inventory items taken away by the identified subjects and the detection of the inventory items dropped on the shelf by the identified subjects.

For example, the method of applying for item No. 111 includes using a third image processor, which includes a front-end image recognition engine that receives corresponding sequences of images from the plurality of cameras, and processes the images to identify and classify the corresponding images. The foreground changes represented in the images in the sequence.

For example, the method of applying for the scope of patent No. 121 includes associating a background change with an identified subject, and performing the detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped by the identified subjects on the shelf. The first set; the second set of causing the front desk to change the association with the identified subjects, and performing the detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped by the identified subjects on the shelf; and the processing inspection These first and second sets are used to generate a log data structure including a list of inventory items of identified entities.

The method of claim 111 includes synchronizing the sequences of images from the cameras in the plurality of cameras.

A method for tracking subjects dropping and taking inventory items in an area of real space, including: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera and the The viewing fields of at least one of the plurality of cameras overlap; and detecting drop and removal of inventory items by semantically identifying significant changes in inventory items on the inventory display structure and making those semantically significant Changes are associated with the subjects represented in those sequences of the image.

For example, the method of applying for the scope of 124 items of patent includes detecting the dropping and removing of the inventory item by identifying the posture of the subject and the inventory items related to the postures represented in the sequences of the images.

A method for tracking subjects dropping and taking inventory items in an area of real space, including: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera and the The viewing field of at least one other of the plurality of cameras overlaps; and detecting the dropping and removing of the inventory item by identifying the posture of the subject and the inventory items related to the postures represented in the sequences of the images.

A computer program product comprising: computer readable memory, which contains non-transitory data storage media; 之 computer instructions stored in the memory, which can be executed by a computer to track a region of real space by a program The multi-joint subject of the program includes: using a plurality of cameras to generate respective sequences of images of a corresponding viewing field in the real space, the viewing field of each camera and the viewing of at least one other camera of the plurality of cameras Domain overlap; use a first image processor, which includes a subject image recognition engine, to process the image to identify the subject represented in the images in the corresponding sequences of the image; use a second image processor, which includes a background An image recognition engine for masking the identified subjects in the images in the sequences of the images to generate masked images, for processing the masked images to identify and classify the images in the corresponding sequences of the images The indicated background changes.

For example, a computer program product with a scope of application of item 127, wherein the background image recognition engines include a convolutional neural network.

For example, a computer program product with the scope of patent application No. 127 includes associating the identified background change with the identified subject.

For example, a computer program product under the scope of application patent No. 127, wherein using the second image processors includes storing background images corresponding to the corresponding sequences of the images; processing the images in the sequences of the images and using the corresponding sequences from the images The background image data of the background images replaces the front stage image data representing the identified subjects to provide the occluded images.

For example, a computer program product with the scope of patent application No. 127, wherein processing the images in the sequences of images includes combining a set of N occluded images in the sequences of images to generate a sequence of factorized images for each camera, And the second image processors recognize and classify background changes by processing the sequence of factorized images.

For example, a computer program product with the scope of patent application No. 127, wherein using the second image processors includes generating changed data structures for the corresponding sequences of the images, the changed data structures including the masked background changes that have been identified The coordinates in the image, the identifiers of the inventory item subjects with the identified background changes, and the categories of the identified background changes; and processing the changed data structure from a set of cameras with overlapping viewing domains to find out the real space These identified background changes.

For example, the computer program product with the scope of patent application No. 132, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image.

In the case of a computer program product with a scope of application of item 133, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image; and Including associating the background change with the identified subjects, and performing the detection of the inventory items taken by the identified subjects and the detection of the inventory items being dropped by the identified subjects on the shelf.

For example, the computer program product under the scope of application for patent No. 127 includes associating the background change with the identified subjects, and performing the detection of the inventory items taken by the identified subjects and the inventory items dropped by the identified subjects on the shelves. Detection.

For example, a computer program product under the scope of patent application No. 127, wherein using the first image processors includes identifying the positions of the hands of the identified subjects; and including comparing the changed positions with the hands of the identified subjects These positions are used to associate the background change with the identified subjects, and to perform the detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped on the shelves by the identified subjects.

For example, a computer program product with a scope of application of No. 127 includes the use of a third image processor, which includes a front-end image recognition engine that receives corresponding sequences of images from the plurality of cameras, and processes the images to identify and classify the images. Foreground changes represented in the images in the corresponding sequences.

For example, a computer program product under the scope of patent application No. 137 includes associating a background change with an identified subject, and performing the detection of the inventory items taken by the identified subjects, and the identified items being put on the shelf by the identified subjects. The first set of inspections; making the front desk change to associate with the identified entities, and performing the second set of inspections of the inventory items taken away by the identified entities and the detection of the inventory items dropped on the shelves by the identified entities; and The detected first and second sets are processed to generate a log data structure including a list of inventory items of identified entities.

For example, a computer program product with a scope of application of item 127 includes synchronizing the sequences of images from the cameras in the plurality of cameras.

A computer program product comprising: computer readable memory, which contains non-transitory data storage media; 之 computer instructions stored in the memory, which can be executed by a computer to track a region of real space by a program By dropping and taking away the inventory items of the subject, the program includes: using a plurality of cameras to generate separate sequences of images of corresponding viewing domains in the real space, the viewing domains of each camera and those of the plurality of cameras The viewing fields of at least one other camera overlap; and detecting the dropping and removing of the inventory items by: semantically identifying significant changes in the inventory items on the inventory display structure and making the semantically significant changes to the images. The subjects represented in these sequences are associated.

For example, a computer program product with a scope of 140 patent applications includes detecting the placement and removal of inventory items by identifying the posture of the subject and the inventory items related to the postures represented in the sequences of the images.

A computer program product comprising: computer readable memory, which contains non-transitory data storage media; 之 computer instructions stored in the memory, which can be executed by a computer to track a region of real space by a program By dropping and taking away the inventory items of the subject, the program includes: using a plurality of cameras to generate separate sequences of images of corresponding viewing domains in the real space, the viewing domains of each camera and those of the plurality of cameras The viewing fields of at least one other camera overlap; and detecting the dropping and removing of the inventory item by identifying the posture of the subject and the inventory items related to the postures represented in the sequences of the images.

A system for tracking a subject dropping and taking inventory items in an area including a real space of an inventory display structure, including: a plurality of cameras, which are arranged on the inventory display structures, and the cameras in the plurality of cameras generate A respective sequence of images of the inventory display structure in a corresponding viewing domain in the real space, the viewing domain of each camera overlapping the viewing domain of at least one other camera among the plurality of cameras; and a processing system, which is Coupled to the plurality of cameras, the processing system includes logic to detect the drop and removal of the inventory item by identifying the posture of the subject and the inventory items related to the postures represented in the sequences of the images.

For example, a system with a scope of 143 patent applications, wherein the logic used to detect the drop and removal of inventory items by identifying the posture of the subject and the inventory items related to the postures includes a front-end image recognition engine, which is processed by The foreground data in the sequences of the images are used to identify the posture, and further include logic to detect the drop and removal of the inventory items by: semantically identifying the inventory on the inventory display structure that includes the background image recognition engine A significant change in the item, the background image recognition engine recognizes the change by processing background data in the sequences of the image.

A system for tracking changes in an area of real space, including: a plurality of cameras, the cameras in the plurality of cameras generating respective sequences of images of corresponding viewing fields in the real space, and the viewing fields of each camera Overlaps with the viewing field of at least one other camera of the plurality of cameras; a processing system coupled to the plurality of cameras, the processing system includes: a first image processor including a subject image recognition engine that receives a plurality of images from a plurality of cameras; The corresponding sequence of camera images is to process the images to identify the subjects represented in those images in the corresponding sequence of images; a second image processor that includes a background image recognition engine that receives the corresponding images from multiple cameras A sequence that masks the identified subjects to generate a masked image, processes the masked images to identify and classify the background changes represented in the images in the corresponding sequences of images; and a third image processor , Which includes a front-end image recognition engine that receives The plurality of complex images corresponding sequence of the camera, an image processing system to change its reception at the plurality of corresponding sequence identification and classification of images in the plurality of images indicated.

For example, the system for applying item No. 145, wherein the front-end image recognition engines and the background image recognition engines include a convolutional neural network.

For example, the system for applying for patent scope No. 145 includes logic to associate the identified background change and the identified foreground change with the identified subject.

For example, the system for applying item No. 145, wherein the second image processors include: background image storage for storing background images of the corresponding sequences of the image; mask logic for processing the sequences of the image And replace the background image data representing the identified subjects with the background image data of the background images from the corresponding sequences of the images to provide the occluded images.

For example, the system of claim 148, wherein the masking logic is a combination of N masked images in the sequences of images to generate a sequence of factorized images for each camera, and the second image processor This sequence of factorized images is processed to identify and classify background changes.

For example, the system of claim 145, wherein the second image processors include logic for generating changed data structures for the corresponding sequences of the images, and the changed data structures include the identified background changes. The coordinates in the occluded image, the identifiers of the inventory item subjects of the identified background changes, and the categories of the identified background changes; and coordination logic to process the changed data structure from the collection of cameras with overlapping viewing fields to Find out the identified background changes in real space.

For example, in the system of applying for the scope of patent No. 150, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image.

If the system of applying for patent scope item 150, wherein the categories of the identified background change in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image; and includes logic For associating the background change with the identified entities, and for performing the detection of the inventory items taken by the identified entities and the detection of the inventory items dropped by the identified entities on the inventory display structure.

For example, the system for applying for item No. 145 includes: logic for associating background changes and identified front desk changes with identified subjects, and for performing inspections by the identified subjects to remove inventory items, and The inspection of the identified entity dropping the inventory item on the inventory display structure.

For example, the system of claim 145, wherein the first image processors identify the positions of the hands of the identified subjects; and include: logic for comparing the positions of the changes with the positions of the identified subjects The positions of the hands are used to associate the background change with the identified subjects, and to perform the detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped by the identified subjects on the inventory display structure.

For example, the system for applying for item No. 145 includes logic for associating a background change with an identified subject, and for performing inspections of items removed by the identified subjects and dropping inventory items by the identified subjects. The first set of inspections on the inventory display structure; logic for associating front desk changes with identified entities, and for performing inspections of inventory items taken by the identified entities and dropping inventory by the identified entities The second set of inspections of items on the inventory display structure; Selection logic for processing the first and second sets of inspections to generate a log data structure that includes a list of inventory items for identified entities.

For example, the system of claim No. 145, wherein the sequences of the images from the cameras in the plurality of cameras are synchronized.

A method for tracking down and removing inventory items from a subject in an area of real space, including: using a plurality of cameras, which are arranged on the inventory display structures to generate corresponding viewing areas in the real space The respective sequences of the images of the inventory display structure, the viewing fields of each camera and the viewing fields of at least one other camera of the plurality of cameras overlap; identify the subject by processing foreground data in the sequences of the images Postures and inventory items related to those postures are used to detect the drop and removal of inventory items.

For example, the method of applying for item 157 of the patent scope includes: (i) semantically identifying significant changes in the inventory items on the inventory display structure by processing background information in the sequences of the images to detect the drop and removal of the inventory items.

For example, the method of claiming scope of patent application 158 includes using a first image processor including a subject image recognition engine to process the image to identify the subject represented in the images in the corresponding sequences of the image; ； wherein the Detecting the drop and removal of inventory items by semantically identifying significant changes in the inventory items includes the use of a second image processor, which includes a background image recognition engine to mask the identified subjects in the images in those sequences of images To generate a masked image, to process the masked image to identify and classify the background changes represented in the images in the corresponding sequences of the images; and to identify the subject ’s pose and the poses Relevant inventory items to detect inventory items to detect the dropping and removal of inventory items include the use of a third image processor, which includes a front-end image recognition engine that receives corresponding sequences of images from the plurality of cameras for processing the images to identify And classify the foreground changes represented in the images in the corresponding sequences of the images .

For example, the method in the 159th patent application range, wherein the background image recognition engines and the foreground image recognition engines include a convolutional neural network.

For example, the method of applying for item 159 of the patent scope includes associating the identified background change and foreground change with the identified subject.

For example, the method of claiming the scope of patent application No. 159, wherein the second image processors include storing background images for corresponding sequences of the images; processing the images in the sequences of the images and using the corresponding sequences from the images The background image data of the background image replaces the front stage image data representing the identified subjects to provide the occluded images.

For example, the method of claim 162, wherein processing the images in the sequences of images includes combining a set of N occluded images in the sequences of images to generate a sequence of factorized images for each camera, and the These second image processors process the sequence of factorized images to identify and classify background changes.

For example, the method of claiming range 162 of the patent application, wherein using the second image processors includes generating changed data structures for the corresponding sequences of the images, and the changed data structures include the masked images whose identified background changes. , The identifiers of the inventory item subjects of the identified background changes, and the categories of the identified background changes; and processing the changed data structure from the set of cameras with overlapping viewing domains to find those in real space A background change has been identified.

For example, in the method of applying for the scope of patent No. 164, the categories of the identified background changes in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image.

If the method of applying for the scope of patents No. 164, wherein the categories of the identified background change in the changed data structure indicate whether the identified inventory item has been added or removed relative to the background image; and includes The background change is associated with the identified entities, and the detection of the inventory items taken away by the identified entities and the detection of the inventory items dropped by the identified entities on the inventory display structure are performed.

For example, the method of claiming scope 159, wherein using the first image processors includes identifying positions of the identified subject's hand; and including comparing the changed positions with the identified subject's hand. Position to associate the background change with the identified subject, and perform the detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped by the identified subjects on the inventory display structure.

For example, the method of applying for item 159 of the patent scope includes associating a background change with an identified entity, and performing the detection of the inventory items taken by the identified entities, and the identification of the inventory items on the inventory display structure by the identified entities. The first set of inspections; Associate the front desk to change with the identified entities, and perform the second set of inspections of the inventory items taken by the identified entities and the detection of the inventory items dropped by the identified entities on the inventory display structure And processing the first and second sets of inspections to produce a log data structure including a list of inventory items for identified entities.

The method of claim 157 includes synchronizing the sequences of images from the cameras in the plurality of cameras.

A computer program product comprising: computer readable memory, which contains non-transitory data storage media; 之 computer instructions stored in the memory, which can be executed by a computer to track a region of real space by a program The multi-joint subject of the program includes: using a sequence of images from a corresponding viewing field in the real space of a plurality of cameras, the viewing field of each camera overlapping the viewing field of at least one other camera of the plurality of cameras ; Use a first image processor, which includes a subject image recognition engine, which processes the image to identify the subjects represented in the images in the corresponding sequences of the images; use a second image processor, which includes background image recognition An engine for masking the identified subjects in the images in the sequences of the images to generate masked images, for processing the masked images to identify and classify the representations in the images in the corresponding sequences of the images Background changes; and using a third image processor that includes a foreground image Recognition engine, receives the corresponding sequence from a plurality of images of the plurality of cameras, for processing an image corresponding to the plurality of SEQ ID foreground and classification of the plurality of video images represented in the image changed.

For example, a computer program product with a scope of application of item 170, wherein the use of the first image processors includes identifying the positions of the hands of the identified subjects; and including comparing the positions of the changes with the hands of the identified subjects These positions are used to associate the background change with the identified subject, and to perform the detection of the inventory items taken by the identified subjects and the detection of the inventory items dropped by the identified subjects on the inventory display structure.

For example, a computer program product under the scope of application for patent No. 170 includes associating a background change with an identified subject, and performing inspections of the inventory items taken by the identified subjects and dropping the inventory items by the identified subjects in the inventory display structure. The first set of inspections on the market; Associate the front desk to change with the identified entities, and perform the inspection of the inventory items taken by the identified entities and the inspection of the inventory items on the inventory display structure by the identified entities. Two sets; and processing the first and second sets of inspections to produce a log data structure that includes a list of inventory items for identified entities.

A method for training a neural network to detect the drop and removal of inventory items by a subject in an area of real space, including: using a plurality of cameras to generate individual sequences of images of script actors, the script actors performing Poses with inventory items in corresponding viewing domains in the real space; and use the sequences of images of script actors to train neural networks to identify the pose of the subject and the representations in the sequences with the images These posture-related inventory items are used to detect the drop and removal of inventory items.

A method for tracking down and removing inventory items of a subject in an area of real space, including: using an image recognition engine to identify the subject and the inventory items in the area of the real space represented in the sequences by the subject Drop and remove; and display a graphic that shows a map of subjects in that area of real space, including colored images that represent individual subjects, and includes assigning colors based on the status of those individual subjects Give those colored images.

A method for tracking down and removing inventory items of a subject in an area of real space, including: using an image recognition engine to identify the subject and the inventory items in the area of the real space represented in the sequences by the subject Drop and remove; and assign those inventory items to individual subjects; display a graphic that is a map of the subject in the area showing real space, including its colored image representing the individual subject, and includes The level of confidence that their inventory items assigned to the respective subjects are correctly identified to assign colors to the colored images.

A method for a subject to drop and take inventory items in an area that tracks real space in a store with store inventory, including: using an image recognition engine to identify the real space represented in the sequences of the subject and the image of the area The inventory items of the entity are dropped and taken away; and the inventory items are assigned to individual entities; and the measured drop and taken are used to generate an inventory of the store.

A method for detecting a directional impression of a subject in a region of real space, comprising: using an image processor to process a sequence of images to identify the subject in the region, and to determine the direction of gaze of the identified subject.