EP3420470A1 - Procédé de description de documents multimedia par traduction inter-modalités, système et programme d'ordinateur associés - Google Patents
Procédé de description de documents multimedia par traduction inter-modalités, système et programme d'ordinateur associésInfo
- Publication number
- EP3420470A1 EP3420470A1 EP17705921.9A EP17705921A EP3420470A1 EP 3420470 A1 EP3420470 A1 EP 3420470A1 EP 17705921 A EP17705921 A EP 17705921A EP 3420470 A1 EP3420470 A1 EP 3420470A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- modality
- point
- description
- points
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/41—Indexing; Data structures therefor; Storage structures
Definitions
- the field of the invention is that of the description of multimedia documents for use in the search for information by the content or for the supervised classification of multimedia contents.
- the invention is more particularly concerned with bringing together a content described by a modality (for example a purely visual content) of a content described by another modality (for example a purely textual content).
- the search for information by content and the supervised classification of documents requires a stage of description of the content of the documents.
- a multimedia document consists of at least two elementary media, for example chosen from images, sounds, video signals, and texts. Due to the heterogeneous nature of the modalities defining a multimedia document, its description is delicate.
- this description step separately transforms the textual content (words) and the visual content (pixels) into a vector of characteristics (features in English) of generally fixed dimension.
- these vectors are indexed in a reference database.
- these vectors are used to model a model using a learning algorithm.
- a first problem is that the textual content is not described by the same type of vector as the visual content.
- these vectors are not usually not the same size. And even if they are by chance of the same dimension, these vectors do not generate the same subspace. In any case, they can not be compared directly. They can not be indexed in the same way for content search, and can not be used to learn the same model for supervised classification.
- CCA Canonical Correlation Analysis
- KCCA Kernel Canonical Correlation Analysis
- a purely textual content document Tl is projected at a point PT1
- a document with a purely visual content V2 is projected at a point PV2.
- the points PT1 and PV2 are in the same space and can therefore be compared directly.
- the projection of the description of the visual content V2 corresponds, for example, to the nearest neighbor of the projection of the description of the textual content T1 in the common representation space.
- the method comprises the following steps: for each multimedia document of the multimodal database, projection of the description of the document according to the first modality in the common representation space so as to have a first point, and projection of the description of the document according to the second modality in the common representation space, so as to have a second point associated with the first point;
- determining a description of the query document according to the second modality comprises calculating a weighted average of k second points associated with the first k identified points, so as to provide a target point;
- the common representation space is divided into a plurality of regions, each region being represented by a quantization code word, and the query point and the k second points associated with the first k identified points are coded according to a dictionary formed by the quantization code words;
- the coding of a point according to the dictionary corresponds to the differences by component of the point with the codewords closest to said point in the common representation space;
- the determination of a description of the request document according to the second modality comprises the calculation of a weighted average of the codings of k second points associated with the first k identified points; the weight associated with a second point in the calculation of the weighted average is a function of the distance between the query point and the first point associated with the second point on the common representation space.
- the invention is also directed to a computer program product comprising program code instructions for performing the steps of the method when said program is executed on a computer. It further extends to a system configured to enable the steps of this method to be performed.
- FIG. 2 is a diagram illustrating the various steps of the method according to the invention.
- FIG. 3 is a diagram illustrating a quantification of the common representation space that can be implemented in a possible embodiment of the invention.
- the invention relates to a method of generating, in a computing device, a multimodal description of a document, called a query document, from a description of the document. document according to a first modality, for example a visual modality VM.
- a first modality for example a visual modality VM.
- the document request may not have a description according to a second modality (the document is for example mono-media), or we can ignore a description according to a second modality of the document (here multimedia) to determine one according to the method according to the invention.
- the term description of a modality means a vector of characteristics representative of said modality in the document.
- a feature vector x T is extracted from its textual content and another feature vector x 1 is extracted from its visual content.
- the method exploits a common representation space Ec both with descriptions according to the first modality and descriptions according to a second modality.
- each document here assimilated to a pair of vectors of characteristics (x ', x T ), is represented by two points: p 1 which corresponds to the projection of x 1 , and p T which corresponds to the projection of x T.
- the textual feature vectors x T are of dimension 300
- the vectors of visual characteristics x 1 are of dimension 4096
- the method also uses a multimodal base Bm consisting of a set of multimedia documents M1, M2, M3 each having a description VI, V2, V3 according to the first modality and a description T1, T2, T3 according to a second modality.
- This basis makes it possible to provide a set of bi-modal pivot points able to reflect the imperfections of the common representation space.
- this bimodal base can correspond to the learning base, without it being necessary.
- the bimodal descriptions of the documents of the multimodal base Bm are projected in the common representation space Ec.
- the method thus comprises a step consisting, for each multimedia document M1, M2, M3 of the multimodal base Bm, of projecting the description V1, V2, V3 of the document according to the first modality in the common representation space. to have a first point PV1, PV2, PV3, and to project the description T1, T2, T3 of the document according to the second modality in the common representation space, so as to have a second point PT1 , PT2, PT3 associated with the first point.
- the method according to the invention also comprises a step of performing the projection of the description VM (also denoted r 1 ) of the request document according to the first modality in the common representation space Ec, so as to have a point PVM query.
- the objective is then to determine, from the PVM request point, one or more target points PTc of the common representation space for completing the description (denoted r T ) of the other modality of the request document.
- a na ⁇ ve approach could be to identify for target points the k nearest neighbors of PVM among the points resulting from a projection of a description according to the second modality (this set of points is denoted NN ⁇ T (r ')).
- this approach would lead, starting from PVM, to identifying the point PT-A referring to a textual content TA stored in a reference database Brief which is a priori different from the learning base having allowed to determine the common space of representation.
- the invention proposes another approach according to which one comes to search for the closest neighbors of the query point r 1 in the common representation space, not among the second points, but among the first points E ⁇ 1 (points of the same modality).
- the integer k is typically greater than 10. It is preferably greater than 20.
- the closest neighbor identification metric is for example a Euclidean distance.
- this step makes it possible to identify the first two neighbors of the PVM request point in the same modality, namely PV1 and PV2.
- the method comprises a step of identifying, among the second points, the k second points associated with the first k identified points.
- PT1 and PT2 which are the complementary points (i.e. they correspond to the other modality) of the first neighbors of the PVM request point in the same modality, namely PV1 and PV2.
- the target point is ⁇ q jeM c (r>) ⁇ ij -
- the weight associated with a second point qjeM ç ( 1 ) in the calculation of the weighted average is a function of the distance between the request point r 1 and the first point qj G (r ') (one of the closest neighbors of r 1 in the same modality) associated with the second point on the common space of representation.
- the method may comprise the identification of one or more documents having a description, for example according to the second modality, the projection of which in the common representation space is closest to the point target. These documents are typically stored in the Brief reference database. According to the example of FIG. 2, this step makes it possible to identify the textual content T-B whose PT-B projection is close to the target point PTc.
- the reference base may be a bi-modal text-image base or a textual or visual mono-modal base. Taking the example of a text query and a mono-modal text reference base, the invention makes it possible to take into account a multimedia aspect. For example, the query "hawai" and the text "florida" can be reconciled because images (of the multimodal base Bm) tagged by these words (or words close to these words) are similar.
- the common representation space Ec is divided into a plurality of regions, each region being represented by a quantization code word C1-C8.
- the various points (in particular the query point and the k second points associated with the first k points identified) are coded according to a dictionary formed by the quantization code words.
- This division of the common representation space can be performed by means of a K-average partitioning algorithm which exploits all the projections of the learning base, coming from both descriptions according to the first modality and from projections according to the second modality. Partitioning provides three types of codewords (which are the centers of partitions).
- the encoding can be performed using techniques known to those skilled in the art, such as those reviewed in the article Yongzhen Huang, Zifeng Wu, Liang Wang, Tieniu Tan, "Feature Coding in Iage Classification: A Comprehensive Study, "I EEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, no. 3, pp. 493-506, March, 2014.
- the coding of a point according to the dictionary can in particular correspond to the differences by component (gradient) of the point with the codewords closest to said point in the common representation space.
- the point PT has the codewords C2, C7 and C8 as closest codewords
- the point PV has the codewords C6, C5 for the closest codewords. and Cl.
- the determination of a description of the request document according to the second modality is carried out from the coding according to the dictionary of each of the k second points associated with the first k identified points.
- a weighted average of these encodings can be made to provide a coded description of a target point PTc.
- the invention is not limited to the method as above, but also extends to a computer program product comprising program code instructions for performing the steps of the method as previously described when said program is executed. on a computer.
- the invention makes it possible to improve the performance in certain cases compared to existing techniques and makes it possible to solve certain recognition problems.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| FR1651591A FR3048295A1 (fr) | 2016-02-26 | 2016-02-26 | Procede de description de documents multimedia par traduction inter-modalites, systeme et programme d'ordinateur associes |
| PCT/EP2017/054148 WO2017144577A1 (fr) | 2016-02-26 | 2017-02-23 | Procédé de description de documents multimedia par traduction inter-modalités, système et programme d'ordinateur associés |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| EP3420470A1 true EP3420470A1 (fr) | 2019-01-02 |
Family
ID=56101592
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP17705921.9A Ceased EP3420470A1 (fr) | 2016-02-26 | 2017-02-23 | Procédé de description de documents multimedia par traduction inter-modalités, système et programme d'ordinateur associés |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP3420470A1 (fr) |
| FR (1) | FR3048295A1 (fr) |
| WO (1) | WO2017144577A1 (fr) |
-
2016
- 2016-02-26 FR FR1651591A patent/FR3048295A1/fr active Pending
-
2017
- 2017-02-23 WO PCT/EP2017/054148 patent/WO2017144577A1/fr not_active Ceased
- 2017-02-23 EP EP17705921.9A patent/EP3420470A1/fr not_active Ceased
Also Published As
| Publication number | Publication date |
|---|---|
| FR3048295A1 (fr) | 2017-09-01 |
| WO2017144577A1 (fr) | 2017-08-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11361017B1 (en) | Method to differentiate and classify fingerprints using fingerprint neighborhood analysis | |
| CN106126617B (zh) | 一种视频检测方法及服务器 | |
| US8266185B2 (en) | System and methods thereof for generation of searchable structures respective of multimedia data content | |
| US6977679B2 (en) | Camera meta-data for content categorization | |
| US7539657B1 (en) | Building parallel hybrid spill trees to facilitate parallel nearest-neighbor matching operations | |
| US20140324840A1 (en) | System and method for linking multimedia data elements to web pages | |
| US20120099793A1 (en) | Video summarization using sparse basis function combination | |
| WO2012141655A1 (fr) | Annotation de produit vidéo avec exploration d'informations web | |
| FR2996939A1 (fr) | Procede de classification d'un objet multimodal | |
| EP1728195A1 (fr) | Procede et systeme servant a effectuer la segmentation semantique de scenes d'une sequence video | |
| FR2968426A1 (fr) | Calcul de comparaison asymetrique a grande echelle pour integrations binaires | |
| KR101634395B1 (ko) | 시퀀스 간의 비교 방법, 그 장치, 및 컴퓨터 프로그램 제품 | |
| WO2016102153A1 (fr) | Representation semantique du contenu d'une image | |
| EP3356955A1 (fr) | Procédé et système de recherche d'images similaires quasi-indépendant de l'échelle de la collection d'images | |
| EP2962301A2 (fr) | Generation d'une signature d'un signal audio musical | |
| Zhang et al. | Large‐scale video retrieval via deep local convolutional features | |
| EP2839410A1 (fr) | Procede de reconnaissance d'un contexte visuel d'une image et dispositif correspondant | |
| Ciaparrone et al. | A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos | |
| WO2005093752A1 (fr) | Procede et systeme de detection de changements de scenes audio et video | |
| EP3420470A1 (fr) | Procédé de description de documents multimedia par traduction inter-modalités, système et programme d'ordinateur associés | |
| WO1999040539A1 (fr) | Procede de segmentation spatiale d'une image en objets visuels et application | |
| FR2830958A1 (fr) | Procede d'indexation, de stockage et de comparaison de documents multimedia | |
| US12417245B2 (en) | Scalable video fingerprinting for content authenticity | |
| Sun et al. | Hash length prediction for video hashing | |
| Bhaumik et al. | Keyframe Selection for Video Indexing Using Approximate Minimal Spanning Tree |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
| 17P | Request for examination filed |
Effective date: 20180827 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| AX | Request for extension of the european patent |
Extension state: BA ME |
|
| DAV | Request for validation of the european patent (deleted) | ||
| DAX | Request for extension of the european patent (deleted) | ||
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
| 17Q | First examination report despatched |
Effective date: 20190709 |
|
| REG | Reference to a national code |
Ref country code: DE Ref legal event code: R003 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
| 18R | Application refused |
Effective date: 20200529 |