CN106021402A

CN106021402A - Multi-modal multi-class Boosting frame construction method and device for cross-modal retrieval

Info

Publication number: CN106021402A
Application number: CN201610316164.4A
Authority: CN
Inventors: 王世勋; 潘鹏; 孙林; 张仕光; 李源
Original assignee: Henan Normal University
Current assignee: Henan Normal University
Priority date: 2016-05-13
Filing date: 2016-05-13
Publication date: 2016-10-12

Abstract

The invention relates to a method and device for constructing a multi-modal multi-class Boosting framework for cross-modal retrieval. The method includes: constructing a target risk function, the target risk function includes the intra-modal loss of each mode and the inter-modal loss of each mode Inter-modal loss; according to the gradient descent strategy, the predictor of each mode in the risk function is updated sequentially, and the predictors of other modes are fixed. When the predictors of all modes are updated, it is called a loop iteration. In this way, after T times of loop iterations, the optimal predictor of each modality that minimizes the objective function is learned; the quasi-edge generated by the optimal predictor of each modality is converted to a common semantic space by using the Sigmoid function, so as to realize Cross-modal retrieval. The method of the present invention takes into account the semantic correlation between modals, can enhance the semantic information in those modals with poor quality to a certain extent, and has better performance in cross-modal retrieval tasks.

Description

Method and device for constructing multi-modal and multi-class Boosting framework for cross-modal retrieval

技术领域technical field

本发明属于信息检索领域，具体涉及一种用于跨模态检索的多模态多类Boosting框架构建方法及装置。The invention belongs to the field of information retrieval, and in particular relates to a method and device for constructing a multi-modal and multi-class Boosting framework for cross-modal retrieval.

背景技术Background technique

Boosting分类方法核心思想是把多个弱分类器结合成一个强分类器，该方法已经在计算机视觉与模式识别等应用领域得到了广泛的研究，并取得了较好的效果。尽管如此，传统的Boosting方法仅从单个模态的数据集中学习分类规则，并不能直接地处理多模态数据集。一般而言，通过把每个模态的数据集单独地映射到语义空间，传统的Boosting方法可以应用于跨模态检索。但是，这个方案并没有考虑至关重要的模态间信息，从而在一定程度上降低了检索的性能。The core idea of the Boosting classification method is to combine multiple weak classifiers into a strong classifier. This method has been widely studied in computer vision and pattern recognition and other application fields, and has achieved good results. Nevertheless, traditional boosting methods only learn classification rules from a single modality dataset and cannot directly deal with multimodal datasets. In general, traditional boosting methods can be applied to cross-modal retrieval by individually mapping each modality's dataset to the semantic space. However, this scheme does not consider the crucial inter-modal information, which degrades the retrieval performance to some extent.

目前，如何表示多媒体数据的底层特征是信息检索领域的重要环节。在这种背景下，众多的研究学者已取得了大量的研究成果，例如图像的SIFT特征、文本的LDA特征以及音频的MFCC特征。然而上述的底层内容特征在维度与属性方面均不相同，这使得不同模态的多媒体数据之间存在异构性与不可比性。实际上，根据一段叙述“黄鹤楼”历史背景的文本数据，用户可能想检索到一幅描述“黄鹤楼”画面的图像数据。在此情况下，虽然文本与图像数据都可以表达出“黄鹤楼”的语义信息，但是传统的单模态检索方法并不能在底层特征上直接计算出它们的相关性。如果关于“黄鹤楼”的文本与图像数据被映射到一个共同的语义空间，那么用户的跨模态检索需求便很容易地得到执行。At present, how to represent the underlying features of multimedia data is an important link in the field of information retrieval. In this context, many researchers have achieved a lot of research results, such as SIFT features of images, LDA features of text and MFCC features of audio. However, the above-mentioned underlying content features are different in terms of dimensions and attributes, which leads to heterogeneity and incomparability among multimedia data of different modalities. In fact, based on a piece of text data describing the historical background of the "Yellow Crane Tower", the user may want to retrieve an image data describing the picture of the "Yellow Crane Tower". In this case, although both text and image data can express the semantic information of "Yellow Crane Tower", the traditional single-modal retrieval method cannot directly calculate their correlation on the underlying features. If the text and image data about "Yellow Crane Tower" are mapped to a common semantic space, then the user's cross-modal retrieval needs can be easily implemented.

在语义空间的学习过程中，词汇表与映射机制均起到了重要的作用。前者限定了语义空间的维度，而后者可以把底层的内容特征投影成高层的语义特征。假设V＝{v₁,...,v_K}表示由K个不同语义概念所组成的语汇表，这些词汇是令人感兴趣的语义类，例如具体的主题与对象的属性。利用这个词汇表，我们可以从数据集中学习出一个映射机制。对于任意的单模态数据x，该映射机制可以给出它属于每一个语义概念v_k的得分。语义空间的每一维对应于词汇表中的每一个概念，因此得分向量π可看作单模态数据在语义空间的语义特征表示。Both the vocabulary and the mapping mechanism play an important role in the learning process of the semantic space. The former defines the dimension of semantic space, while the latter can project low-level content features into high-level semantic features. Assume that V={v ₁ ,...,v _K } represents a vocabulary composed of K different semantic concepts, these vocabularies are interesting semantic classes, such as specific topics and attributes of objects. Using this vocabulary, we can learn a mapping mechanism from the dataset. For any unimodal data x, the mapping mechanism can give it a score belonging to each semantic concept _vk . Each dimension of the semantic space corresponds to each concept in the vocabulary, so the score vector π can be regarded as the semantic feature representation of unimodal data in the semantic space.

根据语义标签变量s的不同取值，单模态数据x可以拥有两种不同类型的语义特征表示。如果标签变量s∈{1,...,K}，那么单模态数据x的语义特征表示是后验的类概率向量，即得分向量π的元素和等于1。在这种情况下，词汇表中的语义概念是互相排斥的，因此单模态数据x只能属于一个语义类。如果标签变量s∈{0,1}^K，那么得分向量π的元素和不等于1。在这种情况下，词汇表中的语义概念不是互相排斥的，因而单模态数据x可以同时属于多个语义类别。这两种语义表示分别反映了单模态数据集的两个意图：1)利用具体的语义类别，单模态数据集可以被划分为多个不相交的集合；2)众多的单模态数据可以共享对象的语义属性。通过简单的两层框架，这两个语义意图可以被结合起来。首先，通过第一层的分类器，单模态数据集被表示成语义属性的得分向量；然后，这些向量被第二层的分类器映射到由具体语义类所构成的语义空间。According to different values of the semantic label variable s, the unimodal data x can have two different types of semantic feature representations. If the label variable s ∈ {1,...,K}, then the semantic feature representation of the unimodal data x is the posterior class probability vector, that is, the element sum of the score vector π is equal to 1. In this case, the semantic concepts in the vocabulary are mutually exclusive, so unimodal data x can only belong to one semantic class. If the label variable s ∈ {0,1} ^K , then the element-wise sum of the score vector π is not equal to 1. In this case, the semantic concepts in the vocabulary are not mutually exclusive, thus unimodal data x can belong to multiple semantic categories at the same time. These two semantic representations respectively reflect two intentions of unimodal datasets: 1) utilizing specific semantic categories, unimodal datasets can be divided into multiple disjoint sets; 2) numerous unimodal datasets Semantic properties of objects can be shared. Through a simple two-layer framework, these two semantic intents can be combined. First, through the classifier of the first layer, the unimodal dataset is represented as score vectors of semantic attributes; then, these vectors are mapped to the semantic space composed of specific semantic classes by the classifier of the second layer.

在语义空间中，单模态数据的语义特征表示能体现出一些优点。首先，语义的描述符是概念类的得分向量，这种表达方式给多媒体数据提供了较高层次的抽象意义。其次，与内容特征相比，语义特征拥有较低的维度与较高的判别性，这使得众多的计算机视觉任务能够被低维的分类器解决。第三，语义的特征表示可以捕获不同语义概念之间的语境关系。例如，属于“天空”类的大部分图像包含有“白云”的概念，因此，“白云”概念的出现意味着图像极有可能属于“天空”类。若“天空”与“白云”的语义特征元素拥有较高的得分，那么视觉系统就可以捕获到它们之间的语境关系。第四，文本分类器的性能通常优于图像分类器的性能，因此文本的语义特征显得更精确一些。通过跨模态的规范化调整，正则化矩阵可以用来去除图像语义特征的噪音。最后，根据语义概念的抽象性，语义空间可以为不同模态的数据提供一致的同构特征表示，这有利于跨模态检索的执行。In the semantic space, the semantic feature representation of unimodal data can show some advantages. First, the semantic descriptor is the score vector of the concept class, which provides a high-level abstract meaning for multimedia data. Second, compared with content features, semantic features have lower dimensions and higher discriminative properties, which enables many computer vision tasks to be solved by low-dimensional classifiers. Third, semantic feature representations can capture contextual relationships between different semantic concepts. For example, most of the images belonging to the category of "sky" contain the concept of "white cloud", therefore, the appearance of the concept of "white cloud" means that the image is very likely to belong to the category of "sky". If the semantic feature elements of "sky" and "white cloud" have a high score, then the visual system can capture the contextual relationship between them. Fourth, the performance of text classifiers is usually better than that of image classifiers, so the semantic features of text appear more precise. By regularizing across modalities, the regularization matrix can be used to denoise the semantic features of images. Finally, according to the abstraction of semantic concepts, semantic spaces can provide consistent isomorphic feature representations for data of different modalities, which facilitates the execution of cross-modal retrieval.

语义空间是一个概率单纯形(Probability Simplex)。一般情况下，对于词汇表的第k个语义概念，可以用两种类型的映射机制来计算数据x的后验概率The semantic space is a probability simplex (Probability Simplex). In general, for the kth semantic concept of the vocabulary, two types of mapping mechanisms can be used to compute the posterior probability of data x

π_k＝P(s＝k|x) (1)π _k ＝P(s＝k|x) (1)

给出已标注的训练数据集，一种映射机制是学习出每个语义概念的条件分布P(x|s)，然后应用贝叶斯规则来计算公式(1)中的后验概率；另一种映射机制是学习出一个多类的分类器，使得公式(1)中的后验概率能被直接地估计出来，即直接的多类Boosting方法。Given the labeled training data set, one mapping mechanism is to learn the conditional distribution P(x|s) of each semantic concept, and then apply Bayesian rule to calculate the posterior probability in formula (1); another The first mapping mechanism is to learn a multi-class classifier, so that the posterior probability in formula (1) can be directly estimated, that is, the direct multi-class Boosting method.

一般而言，单模态的多类Boosting方法可以把每个模态的数据单独地映射到语义空间，从而完成跨模态的匹配。但是这种方案并没有考虑模态间的相关性，在语义空间中可能会产生不理想的效果。图1给出了单模态与多模态的多类Boosting方法投影多模态数据的例子。在该图中，“Semantic Concept 1”表示“Sport”语义类，虚箭头与实箭头分别代表单模态与多模态的Boosting映射，符号“+”与“×”分别表示图像与文本在语义空间的特征表示。如果图像底层特征的质量比较差，那么由单模态的Boosting映射所产生的图像语义特征可能会偏离语义概念“Sport”。如图1的右上角所示，虽然文本的语义特征接近于正确的语义概念，但由于文本与图像的映射单独性，它并不能帮助图像提高其语义特征的质量。Generally speaking, the single-modal multi-class boosting method can map the data of each modality to the semantic space separately, so as to complete the cross-modal matching. However, this scheme does not consider the correlation between modalities, which may produce undesirable effects in the semantic space. Figure 1 shows an example of projecting multimodal data by the multi-class Boosting method of unimodal and multimodal. In the figure, "Semantic Concept 1" represents the "Sport" semantic class, the dotted arrows and solid arrows represent the single-modal and multi-modal Boosting maps, respectively, and the symbols "+" and "×" represent the semantics of images and texts respectively. Characteristic representation of space. If the quality of the underlying features of the image is relatively poor, the semantic features of the image generated by the unimodal Boosting mapping may deviate from the semantic concept "Sport". As shown in the upper right corner of Figure 1, although the semantic features of the text are close to the correct semantic concept, it does not help the image to improve the quality of its semantic features due to the separate mapping between text and image.

为了避免这个问题，急需要一种多模态的多类Boosting方法，将模态内的语义信息与模态间的语义相关性结合起来，同时地分析多模态数据集，达到如图1的右下角所示的效果，即利用各个模态在语义空间的相关性，质量较好的模态语义特征可以增强其他语义特征的质量，使得模态间的语义距离缩短。In order to avoid this problem, there is an urgent need for a multi-modal and multi-class Boosting method, which combines the semantic information within the modal with the semantic correlation between the modals, and simultaneously analyzes the multi-modal data sets to achieve the goal as shown in Figure 1. The effect shown in the lower right corner, that is, using the correlation of each modality in the semantic space, the semantic features of the better quality modality can enhance the quality of other semantic features, so that the semantic distance between the modalities is shortened.

发明内容Contents of the invention

本发明提供了用于跨模态检索的多模态多类Boosting框架构建方法及装置，旨在解决传统的Boosting方法应用于跨模态检索时检索性能不高的问题。The invention provides a method and device for constructing a multi-modal multi-class Boosting framework for cross-modal retrieval, aiming at solving the problem of low retrieval performance when the traditional Boosting method is applied to cross-modal retrieval.

为解决上述技术问题，本发明的用于跨模态检索的多模态多类Boosting框架构建方法包括如下步骤：In order to solve the above-mentioned technical problems, the multimodal and multiclass Boosting framework construction method for cross-modal retrieval of the present invention comprises the following steps:

1)构造目标风险函数R[f₁,...,f_M]，目标风险函数包括各个模态的模态内损耗和各个模态之间的模态间损耗，其中，f₁为第一个模态的预测器，f_M为第M个模态的预测器，M≥2；1) Construct the target risk function R[f ₁ ,...,f _M ], the target risk function includes the intra-modal loss of each mode and the inter-modal loss between each mode, where f ₁ is the first The predictor of the modality, f _M is the predictor of the Mth modality, M≥2;

2)根据梯度下降策略，依次更新目标风险函数中每一个模态的预测器，而固定其他M-1个模态的预测器，当所有模态的预测器均被更新后，称为一次循环迭代，如此经过T次循环迭代后，学习出使目标风险函数最小的各模态的最优预测器，其中T≥1；2) According to the gradient descent strategy, update the predictor of each mode in the target risk function in turn, and fix the predictors of other M-1 modes. When the predictors of all modes are updated, it is called a cycle Iteration, so that after T times of loop iterations, the optimal predictor of each mode that minimizes the target risk function is learned, where T≥1;

3)将各模态的最优预测器所产生的拟边缘转换到一个共同的语义空间，以实现跨模态检索。3) Transform the quasi-edges produced by the optimal predictors of each modality into a common semantic space to enable cross-modal retrieval.

所述步骤2)中一次循环迭代的过程为：The process of a loop iteration in the step 2) is:

A)根据梯度下降策略，计算各个模态文档的权重；A) Calculate the weight of each modal document according to the gradient descent strategy;

B)根据每一个模态文档的权重，求出在更新的预测器的邻近区域内，沿着多类学习器的方向的目标风险函数的一阶泛函偏导数，进而在泛函空间找到最大程度地减少风险的多类学习器，即在泛函空间找到最优方向；B) According to the weight of each modal document, find the first-order functional partial derivative of the target risk function along the direction of the multi-class learner in the vicinity of the updated predictor, and then find the maximum in the functional space A multi-class learner that minimizes risk, that is, finds the optimal direction in the functional space;

C)利用步骤B)中所求出的多类学习器，求出沿着最优方向的最佳步长，根据最佳步长更新预测器。C) Using the multi-class learners obtained in step B), obtain the optimal step size along the optimal direction, and update the predictor according to the optimal step size.

步骤A)中在计算各个模态的权重时是基于各模态的多类指数损耗，某个模态的多类指数损耗定义为：其中，f(x)为某模态的预测器，K为语义词汇表中语义类的个数，＜f(x),c^k-c^s＞表示某模态的预测器关于第k个与第s个语义类的拟边缘差，c^k和c^s分别表示与第k个和第s个语义类对应的码本向量。In step A), when calculating the weight of each mode, it is based on the multi-class index loss of each mode, and the multi-class index loss of a certain mode is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

步骤A)中在计算各个模态的权重时是基于各模态的多类逻辑损耗，某个模态的多类逻辑损耗定义为：其中，f(x)为某模态的预测器，K为语义词汇表中语义类的个数，＜f(x),c^k-c^s＞表示某模态的预测器关于第k个与第s个语义类的拟边缘差，c^k和c^s分别表示与第k个和第s个语义类对应的码本向量。In step A), when calculating the weight of each mode, it is based on the multi-type logic loss of each mode, and the multi-type logic loss of a certain mode is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

所述风险函数R[f₁,...,f_M]表示为：The risk function R[f ₁ ,...,f _M ] is expressed as:

$\begin{matrix} R R [[{f f}_{11},, ... ...,, {f f}_{M m}]] = = {Σ Σ}_{m m = = 11}^{M m} {R R}_{m m} [[{f f}_{m m} (({z z}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {R R}_{m m j j} [[{f f}_{m m} (({z z}^{m m})),, {f f}_{j j} (({z z}^{j j}))]] \\ = = {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{i i = = 11}^{N N} {L L}_{m m} [[{s the s}_{i i},, {f f}_{m m} (({z z}_{i i}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {Σ Σ}_{i i = = 11}^{N N} | | | | {C C}^{T T} [[{f f}_{m m} (({z z}_{i i}^{m m})) - - {f f}_{j j} (({z z}_{z z}^{j j}))]] | | {| |}_{22}^{22} \end{matrix}$

其中，表示第m个模态的第i个数据对象，而L_m[·]与f_m(·)分别表示第m个模态的多类损耗函数与预测器，表示第m个与第j个模态关于第i个数据的模态间损耗。in, represents the i-th data object of the m-th modality, and L _m [·] and f _m (·) represent the multi-class loss function and predictor of the m-th modality, respectively, Indicates the intermodal loss between the mth and jth modals with respect to the ith data.

本发明的用于跨模态检索的多模态多类Boosting框架构建装置包括目标函数构建模块、最优预测器学习模块和语义空间转换模块；The multimodal and multiclass Boosting frame construction device for cross-modal retrieval of the present invention includes an objective function construction module, an optimal predictor learning module and a semantic space conversion module;

目标函数构建模块，用于构造目标风险函数R[f₁,...,f_M]，目标风险函数包括各个模态的模态内损耗和各个模态之间的模态间损耗，其中，f₁为第一个模态的预测器，f_M为第M个模态的预测器，M≥2；The objective function building block is used to construct the objective risk function R[f ₁ ,...,f _M ], the objective risk function includes the intra-modal loss of each mode and the inter-modal loss between each mode, where, f ₁ is the predictor of the first mode, f _M is the predictor of the Mth mode, M≥2;

最优预测器学习模块，用于根据梯度下降策略，依次更新目标风险函数中每一个模态的预测器，而固定其他M-1个模态的预测器，当所有模态的预测器均被更新后，称为一次循环迭代，如此经过T次循环迭代后，学习出使目标风险函数最小的各模态的最优预测器，其中T≥1；The optimal predictor learning module is used to sequentially update the predictors of each mode in the target risk function according to the gradient descent strategy, and fix the predictors of other M-1 modes. When the predictors of all modes are After updating, it is called a loop iteration. After T loop iterations, the optimal predictor of each mode that minimizes the target risk function is learned, where T≥1;

语义空间转换模块，用于将各模态的最优预测器所产生的拟边缘转换到一个共同的语义空间，以实现跨模态检索。The semantic space conversion module is used to convert the quasi-edges produced by the optimal predictors of each modality into a common semantic space for cross-modal retrieval.

所述一次循环迭代的过程为：The process of one cycle iteration is:

在计算各个模态的权重时是基于各模态的多类指数损耗，某个模态的多类指数损耗定义为：其中，f(x)为某模态的预测器，K为语义词汇表中语义类的个数，＜f(x),c^k-c^s＞表示某模态的预测器关于第k个与第s个语义类的拟边缘差，c^k和c^s分别表示与第k个和第s个语义类对应的码本向量。When calculating the weight of each mode, it is based on the multi-class exponential loss of each mode. The multi-class exponential loss of a certain mode is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

在计算各个模态的权重时是基于各模态的多类逻辑损耗，某个模态的多类逻辑损耗定义为：其中，f(x)为某模态的预测器，K为语义词汇表中语义类的个数，＜f(x),c^k-c^s＞表示某模态的预测器关于第k个与第s个语义类的拟边缘差，c^k和c^s分别表示与第k个和第s个语义类对应的码本向量。When calculating the weight of each mode, it is based on the multi-type logic loss of each mode. The multi-type logic loss of a certain mode is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

本发明的有益效果是：本发明构造一个以实验为依据的目标风险函数，该风险结合了模态内与模态间的损耗。通过最小化目标函数，多模态多类的Boosting框架(Multimodal Multiclass Boosting，MMB)可以挖掘出模态内的语义信息与模态间的语义相关性。这两种类型的语义信息在一定程度上具有互补性，因此它们的结合有利于跨模态检索性能的提高。通过利用梯度下降策略来轮流地更新每一个模态的预测器，MMB框架可以轻松地解决多维泛函空间内的优化问题。基于Sigmoid函数，最优预测器所产生的拟边缘可以转换为语义概念类的后验概率，使得跨模态检索能够在语义空间中被执行。该方法一方面，模态内的语义信息反映了每个模态的语义表达能力，而通过最小化模态间损耗得到的模态间语义信息更关注于不同模态之间的相关性。这两种类型的语义信息在跨模态检索的过程中都有重要的作用，并且二者是互补的。因此，它们的结合有益于检索性能的提高。另一方面，通过最小化模态内的损耗，拥有高质量底层特征的模态数据能够获得质量较好的模态内语义信息；同时，模态间的语义相关性在一定程度上可以增强那些质量较差的模态内语义信息。因此，MMB框架在跨模态检索的任务中拥有较好的性能。The beneficial effect of the present invention is that the present invention constructs an experimentally based target risk function, which combines intra-modal and inter-modal losses. By minimizing the objective function, the multimodal and multiclass Boosting framework (Multimodal Multiclass Boosting, MMB) can mine the semantic information within the modality and the semantic correlation between the modalities. These two types of semantic information are complementary to a certain extent, so their combination is beneficial to the improvement of cross-modal retrieval performance. By using a gradient descent strategy to update the predictors of each modality in turn, the MMB framework can easily solve optimization problems in multi-dimensional functional spaces. Based on the sigmoid function, the quasi-margins generated by the optimal predictor can be transformed into the posterior probabilities of semantic concept classes, enabling cross-modal retrieval to be performed in the semantic space. On the one hand, the semantic information within a modality reflects the semantic expressiveness of each modality, while the inter-modal semantic information obtained by minimizing the inter-modal loss focuses more on the correlation between different modalities. Both types of semantic information play an important role in the process of cross-modal retrieval, and they are complementary. Therefore, their combination is beneficial to the improvement of retrieval performance. On the other hand, by minimizing intra-modal loss, modal data with high-quality underlying features can obtain better-quality intra-modal semantic information; at the same time, inter-modal semantic correlation can enhance those Intra-modal semantic information of poor quality. Therefore, the MMB framework has better performance in the task of cross-modal retrieval.

附图说明Description of drawings

图1为单模态与多模态的多类Boosting方法投影多模态数据的样例示意图；Figure 1 is a schematic diagram of a sample of single-modal and multi-modal multi-class Boosting methods projecting multi-modal data;

图2为Wiki数据集上的PR曲线图，其中，左边的图是文本查询图像，右边的图是图像查询文本；Figure 2 is the PR curve graph on the Wiki dataset, where the left graph is the text query image, and the right graph is the image query text;

图3为Wiki数据集上的召回率曲线，其中，左边的图是文本查询图像，右边的图是图像查询文本；Figure 3 is the recall rate curve on the Wiki dataset, where the left graph is the text query image, and the right graph is the image query text;

图4为NUS-WIDE数据集上的PR曲线，其中，左边的图是文本查询图像，右边的图是图像查询文本；Figure 4 is the PR curve on the NUS-WIDE dataset, where the left picture is the text query image, and the right picture is the image query text;

图5为NUS-WIDE数据集上的召回率曲线，其中，左边的图是文本查询图像，右边的图是图像查询文本。Figure 5 shows the recall rate curve on the NUS-WIDE dataset, where the left graph is the text query image and the right graph is the image query text.

具体实施方式detailed description

下面结合附图，对本发明的技术方案作进一步详细介绍。The technical solutions of the present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明的用于跨模态检索的多模态多类Boosting框架构建方法实施例Embodiment of the multi-modal and multi-class Boosting framework construction method for cross-modal retrieval of the present invention

本实施例的用于跨模态检索的多模态多类Boosting框架构建方法将不同模态的数据映射到一个共同的语义空间，妥善地保存模态内的语义信息与模态间的语义相关性，具体步骤包括：The method for constructing a multi-modal and multi-class Boosting framework for cross-modal retrieval in this embodiment maps data of different modalities to a common semantic space, and properly preserves semantic information within a modal and semantic correlation between modals specific steps include:

2)根据梯度下降策略，依次更新风险函数中每一个模态的预测器，而固定其他M-1个模态的预测器，当所有模态的预测器均被更新后，称为一次循环迭代，如此经过T次循环迭代，学习出使目标函数最小的各个模态的最优预测器；2) According to the gradient descent strategy, update the predictor of each mode in the risk function in turn, and fix the predictors of other M-1 modes. When the predictors of all modes are updated, it is called a loop iteration , so after T times of loop iterations, the optimal predictor of each mode that minimizes the objective function is learned;

3)利用Sigmoid函数将各个模态的最优预测器所产生的拟边缘转换到一个共同的语义空间，以实现跨模态检索。3) Use the Sigmoid function to transform the quasi-edges generated by the optimal predictors of each modality into a common semantic space to achieve cross-modal retrieval.

本实施例的MMB框架可以应用于多种媒体信息检索中，如文本、图像、音频、视频等等，假设M表示模态的数目，那么以实验为依据的风险函数R[f₁,...,f_M]被相应地定义成如下的形式The MMB framework of this embodiment can be applied to various media information retrieval, such as text, image, audio, video, etc., assuming that M represents the number of modalities, then the risk function R[f ₁ , .. .,f _M ] are correspondingly defined as follows

$\begin{matrix} R R [[{f f}_{11},, ... ...,, {f f}_{M m}]] = = {Σ Σ}_{m m = = 11}^{M m} {R R}_{m m} [[{f f}_{m m} (({z z}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {R R}_{m m j j} [[{f f}_{m m} (({z z}^{m m})),, {f f}_{j j} (({z z}^{j j}))]] \\ = = {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{i i = = 11}^{N N} {L L}_{m m} [[{s the s}_{i i},, {f f}_{m m} (({z z}_{i i}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {Σ Σ}_{i i = = 11}^{N N} | | | | {C C}^{T T} [[{f f}_{m m} (({z z}_{i i}^{m m})) - - {f f}_{j j} (({z z}_{z z}^{j j}))]] | | {| |}_{22}^{22} \end{matrix} - - - - - - ((24 twenty four))$

其中表示第m个模态的第i个数据对象，而L_m[·]与f_m(·)分别表示第m个模态的多类损耗函数与预测器。in represents the i-th data object of the m-th modality, and L _m [ ] and f _m ( ) represent the multi-class loss function and predictor of the m-th modality, respectively.

为了最小化公式(24)中的风险，我们可以先固定所有的预测器f_i(i≠1)，进而更新第一个模态的预测器。然后，我们固定更新后的预测器f₁与其它没有被更新的预测器f_i(i≠1,2)，从而更新第二个模态的预测器。按照这种方式，我们可以更新每一个模态的预测器。当最后一个模态的预测器被更新后，迭代过程进入下一次的循环，使得所有模态的最终预测器能够被共同地学习出来。因此，通过利用Sigmoid函数把每个模态的拟边缘转换到一个共同的语义空间，我们可以解决新的跨模态检索问题。To minimize the risk in Equation (24), we can first fix all predictors f _i (i≠1), and then update the predictor of the first mode. Then, we fix the updated predictor f ₁ with the other predictors f _i (i≠1,2) that have not been updated, thus updating the predictor of the second modality. In this way, we can update the predictor for each modality. When the predictor of the last modality is updated, the iterative process enters the next cycle, so that the final predictors of all modalities can be jointly learned. Thus, by utilizing the sigmoid function to transform the quasi-edges of each modality into a common semantic space, we can address new cross-modal retrieval problems.

下面以文本和图像这两种模态对本实施例的MMB框架的构成进行详细阐述：The composition of the MMB framework of the present embodiment is described in detail below in two modes of text and image:

假设多模态数据集为(X,Y,S)＝{(x₁,y₁,s₁),...,(x_N,y_N,s_N)}，其中X与Y分别代表图像集与文本集，而S与N分别代表语义词汇表与多模态文档的数目。如果语义词汇表含有K个不同的语义类，那么训练集中的每一个图像与每一个文本均带有一个语义类s_i∈{1,...,K}，其中，表示实数空间。然而，测试集中的图像与文本并没有用语义类来标注。给出测试集中的一个图像(文本)查询，跨模态检索的目标是在被检索的文本(图像)空间中，寻找出语义相似的数据对象。在此基础上，我们给出了跨模态检索的形式化定义。Suppose the multimodal data set is (X,Y,S)={(x ₁ ,y ₁ ,s ₁ ),...,(x _N ,y _N ,s _N )}, where X and Y represent images respectively set and text set, while S and N represent the number of semantic vocabulary and multimodal documents, respectively. If the semantic vocabulary contains K different semantic classes, then each image in the training set with every text Both have a semantic class s _i ∈ {1,...,K}, where, represents the space of real numbers. However, the images and texts in the test set are not annotated with semantic classes. Given an image (text) query in the test set, the goal of cross-modal retrieval is to find semantically similar data objects in the retrieved text (image) space. On this basis, we give a formal definition of cross-modal retrieval.

定义1：给出一个查询对象与被检索的数据集假设与分别代表查询对象与任意数据i∈O的语义特征向量，而d(·,·)表示距离度量。在条件a,b∈{I,T}且a≠b下，跨模态检索的本质是对集合O内的数据对象进行排序，使得的值逐渐增加。Definition 1: Given a query object with the retrieved dataset suppose and denote the semantic feature vectors of the query object and arbitrary data i∈O, respectively, and d( , ) denote the distance measure. Under the condition a,b∈{I,T} and a≠b, the essence of cross-modal retrieval is to sort the data objects in the set O such that value gradually increases.

如果K个不同的单位向量是中心位于原点的K-1维正则形的顶点，那么这些单位向量可以构成一个码本矩阵C＝[c¹,...,c^K]。因此，每一个语义类概念k可以用单位向量来重新编码。假设与分别代表图像与文本的预测器，那么它们关于第k个语义类的拟边缘可以分别地表示成＜f(x),c^k＞与＜u(y),c^k＞，其中＜·,·＞是标准的内积。为了寻找不同模态的最优预测器，我们定义了以实验为依据的风险函数If K different unit vectors are vertices of a K-1-dimensional regular shape whose center is located at the origin, then these unit vectors can constitute a codebook matrix C=[c ¹ ,...,c ^K ]. Therefore, each semantic class concept k can be represented by the unit vector to recode. suppose and represent image and text predictors respectively, then their quasi-edges for the k-th semantic class can be expressed as <f(x), c ^k > and <u(y), c ^k >, where <·,· > is the standard inner product. To find optimal predictors for different modalities, we define the experimentally informed hazard function

$\begin{matrix} R R [[f f,, u u]] = = {R R}_{11} [[f f ((x x))]] + + {R R}_{22} [[u u ((y the y))]] + + {R R}_{33} [[f f ((x x)),, u u ((y the y))]] \\ = = {Σ Σ}_{i i = = 11}^{N N} {L L}_{11} [[{s the s}_{i i},, f f (({x x}_{i i}))]] + + {Σ Σ}_{i i = = 11}^{N N} {L L}_{22} [[{s the s}_{i i},, u u (({y the y}_{i i}))]] + + {Σ Σ}_{i i = = 11}^{N N} | | | | {C C}^{T T} [[f f (({x x}_{i i})) - - u u (({y the y}_{i i}))]] | | {| |}_{22}^{22} \end{matrix} - - - - - - ((22))$

其中L[·,·]表示多类损耗函数。在公式(2)中，前两项表达式分别地代表了图像与文本的模态内损耗，而最后一项表达式揭示了图像与文本的模态间损耗。模态内损耗通常可以精确地捕获每个模态数据的语义类信息，而模态间损耗可以挖掘不同模态数据之间的语义相关性。一般情况下，风险函数的最小化可以通过如下所示的优化问题来求解where L[·,·] represents the multi-class loss function. In Equation (2), the first two expressions represent the intra-modal loss of image and text respectively, while the last expression reveals the inter-modal loss of image and text. Intra-modal loss can usually accurately capture the semantic class information of each modality data, while inter-modal loss can mine the semantic correlation between different modal data. In general, the minimization of the hazard function can be solved by an optimization problem as shown below

$\{\begin{matrix} \underset{f f,, u u}{m m i i n no} & R R [[f f ((x x)),, u u ((y the y))]] \\ s the s . . t t & f f ((x x)) &Element; &Element; s the s p p a a n no ((H h)),, u u ((y the y)) &Element; &Element; s the s p p a a n no ((\overset{&OverBar; &OverBar;}{H h})) \end{matrix} - - - - - - ((33))$

其中H＝{g_i(x)}与分别代表弱学习器与的集合，而span(·)表示由多类弱学习器的线性组合所构成的泛函空间。where H={g _i (x)} and represent weak learners and , and span( ) represents the functional space formed by the linear combination of multiple classes of weak learners.

本实施例采用了多类指数损耗和多类逻辑损耗。采用多类指数损耗的过程为：This embodiment adopts multiple types of exponential loss and multiple types of logical loss. The procedure for employing multiclass exponential loss is:

多类损耗是拟边缘的非负函数，因此，图像与文本的多类指数损耗可以分别地定义为The multi-class loss is a non-negative function of quasi-edge, therefore, the multi-class exponential loss of image and text can be defined as

${L L}_{11} [[s the s,, f f ((x x))]] = = {Σ Σ}_{k k = = 11}^{K K} exp exp ((< < f f ((x x)),, {c c}^{k k} - - {c c}^{s the s} > >)) - - - - - - ((44))$

${L L}_{22} [[s the s,, u u ((y the y))]] = = {Σ Σ}_{k k = = 11}^{K K} exp exp ((< < u u ((y the y)),, {c c}^{k k} - - {c c}^{s the s} > >)) - - - - - - ((55))$

上述两个公式中的多类指数损耗在Boosting多分类任务中具有贝叶斯一致性、多类边缘极大化与猜测背离性的优点。The multi-class exponential loss in the above two formulas has the advantages of Bayesian consistency, multi-class edge maximization and guessing divergence in Boosting multi-classification tasks.

对于公式(3)的优化问题，我们很难同时求解出最优的图像预测器与文本预测器。然而通过每次仅调整一个预测器，该优化问题可以轻易地得到解决。令f^t(x)与u^t(y)分别代表第t次Boosting迭代后的图像与文本预测器，不失一般性，我们可以首先固定文本预测器，进而执行图像预测器的更改过程。在预测器f^t(x)的邻近区域内，沿着多类弱学习器g(x)的方向，目标函数R[f(x),u^t(y)]的一阶泛函偏导数可表示为For the optimization problem of formula (3), it is difficult for us to solve the optimal image predictor and text predictor simultaneously. However, this optimization problem can be easily solved by tuning only one predictor at a time. Let f ^t (x) and u ^t (y) represent the image and text predictors after the t-th Boosting iteration respectively. Without loss of generality, we can first fix the text predictor, and then perform the changing process of the image predictor. In the neighborhood of the predictor f ^t (x), along the direction of the multi-class weak learner g(x), the first-order functional partial derivative of the objective function R[f(x), u ^t (y)] can be Expressed as

$\begin{matrix} δ δ R R [[{f f}^{t t};; g g]] = = \frac{\partial \partial R R [[{f f}^{t t} + + ξ ξ g g,, {u u}^{t t}]]}{\partial \partial ξ ξ} {| |}_{ξ ξ = = 00} \\ = = - - {Σ Σ}_{i i = = 11}^{N N} < < g g (({x x}_{i i})),, {P P}_{i i} > > \end{matrix} - - - - - - ((66))$

其中 in

${P P}_{i i} = = {Σ Σ}_{k k = = 11}^{K K} (({c c}^{{s the s}_{i i}} - - {c c}^{k k})) exp exp ((< < {f f}^{t t} (({x x}_{i i})),, {c c}^{k k} - - {c c}^{{s the s}_{i i}} > >)) - - 22 {CC CC}^{T T} (({f f}^{t t} (({x x}_{i i})) - - {u u}^{t t} (({y the y}_{i i})))) - - - - - - ((77))$

在第t+1次图像迭代的过程中，根据梯度下降策略，最大限度地减少风险的多类弱学习器g^*(x)可以表示为During the t+1th image iteration, according to the gradient descent strategy, the multi-class weak learner g ^* (x) that minimizes the risk can be expressed as

$\begin{matrix} {g g}^{* *} ((x x)) = = arg arg \underset{g g &Element; &Element; H h}{m m i i n no} δ δ R R [[{f f}^{t t};; g g]] \\ = = arg arg \underset{g g &Element; &Element; H h}{m m i i n no} {Σ Σ}_{i i = = 11}^{N N} < < g g (({x x}_{i i})),, {P P}_{i i} > > \end{matrix} - - - - - - ((88))$

沿着该方向的最优步长为The optimal step size along this direction is

因此，图像预测器被更改为Therefore, the image predictor is changed to

f^t+1(x)＝f^t(x)+α^*g^*(x) (10)f ^t+1 (x)＝f ^t (x)+α ^* g ^* (x) (10)

其次，固定已得到的图像预测器，执行文本预测器的更改过程。在预测器u^t(y)的邻近区域内，沿着多类弱学习器v(y)的方向，目标函数R[f^t+1(x),u(y)]的一阶泛函偏导数可表示为Second, fix the obtained image predictor and perform the change process of the text predictor. In the neighborhood of the predictor u ^t (y), along the direction of the multi-class weak learner v(y), the first-order functional partial of the objective function R[f ^t+1 (x), u(y)] The derivative can be expressed as

$\begin{matrix} δ δ R R [[{u u}^{t t};; v v]] = = \frac{\partial \partial R R [[{f f}^{t t + + 11},, {u u}^{t t} + + ϵ ϵ v v]]}{\partial \partial ϵ ϵ} {| |}_{ϵ ϵ = = 00} \\ = = - - {Σ Σ}_{i i = = 11}^{N N} < < v v (({y the y}_{i i})),, {Q Q}_{i i} > > \end{matrix} - - - - - - ((1111))$

其中 in

${Q Q}_{i i} = = {Σ Σ}_{k k = = 11}^{K K} (({c c}^{{s the s}_{i i}} - - {c c}^{k k})) exp exp ((< < {u u}^{t t} (({y the y}_{i i})),, {c c}^{k k} - - {c c}^{{s the s}_{i i}} > >)) + + 22 {CC CC}^{T T} (({f f}^{t t + + 11} (({x x}_{i i})) - - {u u}^{t t} (({y the y}_{i i})))) - - - - - - ((1212))$

在第t+1次文本迭代的过程中，根据梯度下降策略，最大限度地减少风险的多类弱学习器v^*(y)可以表示为During the t+1th text iteration, according to the gradient descent strategy, the multi-class weak learner v ^* (y) that minimizes the risk can be expressed as

$\begin{matrix} {v v}^{* *} ((y the y)) = = arg arg \underset{v v &Element; &Element; \overset{&OverBar; &OverBar;}{H h}}{m m i i n no} δ δ R R [[{u u}^{t t};; v v]] \\ = = arg arg \underset{v v &Element; &Element; \overset{&OverBar; &OverBar;}{H h}}{max max} {Σ Σ}_{i i = = 11}^{N N} < < v v (({y the y}_{i i})),, {Q Q}_{i i} > > \end{matrix} - - - - - - ((1313))$

与公式(9)稍微不同，沿着方向v^*(y)的最优步长为Slightly different from Equation (9), the optimal step size along the direction v ^* (y) is

因此，文本预测器被更改为Therefore, the text predictor is changed to

u^t+1(y)＝u^t(y)+α^*v^*(y) (15)u ^{t + 1} (y) = u ^t (y) + α ^* v ^* (y) (15)

通过这样相互交替的更改过程，我们可以寻找出最优的图像与文本预测器。基于多类指数损耗函数，我们的多模态多类Boosting框架被记为E_MMB。Through this process of alternating changes, we can find the best image and text predictors. Based on the multi-class exponential loss function, our multi-modal multi-class Boosting framework is denoted as E_MMB.

表1中的算法1详细地阐述了E_MMB方法的详细步骤，在每次迭代中，E_MMB算法不是同时地学习图像与文本的预测器，而是在固定一个预测器的情况下调整另外一个预测器。例如，从算法1的第3行到第6行，文本预测器被固定，而图像预测器被调整。此时，图像预测器的调整过程是扩展的单模态多类Boosting，并且公式(2)中的第2个表达式可以被看作常量。在第3行，根据梯度下降的策略，我们能够获得每一个多模态文档的权重。该权重不仅体现了图像的语义信息，也暗含了图像向文本靠拢的趋势。在第4行，利用所有多模态文档的权重，我们可以在泛函空间内找到一个最大程度地减少风险的方向。我们在第5行计算了沿着最优方向的最佳步长，并在第6行更新了图像预测器。类似地，从算法1的第7行到第10行，我们固定了刚刚更新后的图像预测器，并且调整了文本的预测器。Algorithm 1 in Table 1 elaborates the detailed steps of the E_MMB method. In each iteration, the E_MMB algorithm does not simultaneously learn image and text predictors, but adjusts another predictor while fixing one predictor . For example, from line 3 to line 6 of Algorithm 1, the text predictor is fixed while the image predictor is tuned. At this time, the adjustment process of the image predictor is an extended single-modal multi-class Boosting, and the second expression in formula (2) can be regarded as a constant. In line 3, according to the gradient descent strategy, we can obtain the weight of each multimodal document. This weight not only reflects the semantic information of the image, but also implies the tendency of the image to move closer to the text. In row 4, using the weights of all multimodal documents, we can find a direction within the functional space that minimizes the risk. We compute the optimal step size along the optimal direction on line 5 and update the image predictor on line 6. Similarly, from lines 7 to 10 of Algorithm 1, we fix the just-updated image predictor and adjust the text predictor.

总的来说，算法1包含了一个集成迭代循环，每个循环的作用是寻找最优的图像与文本预测器。对于每次循环，计算图像与文本的多类弱学习器占了主要的时间开销。一般而言，图像与文本的底层特征维度并不相同，因此计算每个多类弱学习器的代价是不一样的。若图像与文本的多类弱学习器的计算代价分别为O(μ)与O(τ)，那么E_MMB算法的时间复杂度大约为O(μλ+τλ)。In general, Algorithm 1 consists of an integrated iterative loop, and the role of each loop is to find the optimal image and text predictor. Computing multi-class weak learners for images and text accounts for the major time overhead for each iteration. Generally speaking, the underlying feature dimensions of images and text are not the same, so the cost of computing each multi-class weak learner is different. If the calculation costs of multi-class weak learners for images and text are O(μ) and O(τ) respectively, then the time complexity of the E_MMB algorithm is about O(μλ+τλ).

表1多模态多类Boosting(E_MMB)算法Table 1 Multimodal multiclass Boosting (E_MMB) algorithm

采用多类逻辑损耗的过程为：图像与文本的多类逻辑损耗可以分别地定义为The process of using multi-class logical loss is: the multi-class logical loss of image and text can be defined as

${L L}_{11} [[s the s,, f f ((x x))]] = = {Σ Σ}_{k k = = 11}^{K K} l l o o g g [[11 + + exp exp ((< < f f ((x x)),, {c c}^{k k} - - {c c}^{s the s} > >))]] - - - - - - ((1616))$

${L L}_{22} [[s the s,, u u ((y the y))]] = = {Σ Σ}_{k k = = 11}^{K K} l l o o g g [[11 + + exp exp ((< < u u ((y the y)),, {c c}^{k k} - - {c c}^{s the s} > >))]] - - - - - - ((1717))$

根据相关知识，上述两个公式中的多类逻辑损耗在Boosting多分类任务中也具有贝叶斯一致性、多类边缘极大化与猜测背离性的优点。According to relevant knowledge, the multi-class logic loss in the above two formulas also has the advantages of Bayesian consistency, multi-class edge maximization and guessing divergence in Boosting multi-classification tasks.

类似地，我们先执行图像预测器的更改过程。在预测器f^t(x)的邻域内，沿着多类弱学习器g(x)的方向，目标函数R[f(x),u^t(y)]的一阶泛函偏导数为Similarly, we first perform the change process of the image predictor. In the neighborhood of the predictor f ^t (x), along the direction of the multi-class weak learner g(x), the first-order functional partial derivative of the objective function R[f(x), u ^t (y)] is

$δ δ R R [[{f f}^{t t};; g g]] = = - - {Σ Σ}_{i i = = 11}^{N N} < < g g (({x x}_{i i})),, {PP PP}_{i i} > > - - - - - - ((1818))$

其中 in

${PP PP}_{i i} = = {Σ Σ}_{k k = = 11}^{K K} (({c c}^{{s the s}_{i i}} - - {c c}^{k k})) \frac{exp exp ((< < {f f}^{t t} (({x x}_{i i})),, {c c}^{k k} - - {c c}^{{s the s}_{i i}} > >))}{11 + + exp exp ((< < {f f}^{t t} (({x x}_{i i})),, {c c}^{k k} - - {c c}^{{s the s}_{i i}} > >))} - - 22 {CC CC}^{T T} (({f f}^{t t} (({x x}_{i i})) - - {u u}^{t t} (({y the y}_{i i})))) - - - - - - ((1919))$

在第t+1次图像迭代的过程中，根据公式(19)、公式(8)与公式(9)，我们可以相应地获得最优的图像弱学习器及其步长。During the t+1th image iteration, according to formula (19), formula (8) and formula (9), we can obtain the optimal image weak learner and its step size accordingly.

接着，我们执行文本预测器的更改过程。在预测器u^t(y)的邻近区域内，沿着多类弱学习器v(y)的方向，目标函数R[f^t+1(x),u(y)]的一阶泛函偏导数为Next, we perform the modification process of the text predictor. In the neighborhood of the predictor u ^t (y), along the direction of the multi-class weak learner v(y), the first-order functional partial of the objective function R[f ^t+1 (x),u(y)] The derivative is

$δ δ R R [[{u u}^{t t};; v v]] = = - - {Σ Σ}_{i i = = 11}^{N N} < < v v (({y the y}_{i i})),, {QQ QQ}_{i i} > > - - - - - - ((2020))$

其中 in

${QQ QQ}_{i i} = = {Σ Σ}_{k k = = 11}^{K K} (({c c}^{{s the s}_{i i}} - - {c c}^{k k})) \frac{exp exp ((< < {u u}^{t t} (({y the y}_{i i})),, {c c}^{k k} - - {c c}^{{s the s}_{i i}} > >))}{11 + + exp exp ((< < {u u}^{t t} (({y the y}_{i i})),, {c c}^{k k} - - {c c}^{{s the s}_{i i}} > >))} + + 22 {CC CC}^{T T} (({f f}^{t t + + 11} (({x x}_{i i})) - - {u u}^{t t} (({y the y}_{i i})))) - - - - - - ((21 twenty one))$

在第t+1次文本迭代的过程中，根据公式(20)、公式(13)与公式(14)，我们可以相应地获得最优的多类文本弱学习器及其步长。随着迭代次数的增加，我们能够交替地寻找出最优的图像与文本预测器。为了在下文中方便地叙述，我们把基于多类逻辑损耗函数的多模态多类Boosting框架记为L_MMB。During the t+1th text iteration, according to formula (20), formula (13) and formula (14), we can obtain the optimal multi-class text weak learner and its step size accordingly. As the number of iterations increases, we are able to alternately find optimal image and text predictors. For the convenience of description below, we denote the multi-modal multi-class Boosting framework based on multi-class logistic loss function as L_MMB.

另外，对于表1的第3行与第7行，如果表达式PP_i与QQ_i分别地代替了表达式P_i与Q_i，那么我们就可以轻松地得到L_MMB算法。L_MMB算法与E_MMB算法有相同级别的时间复杂度。In addition, for the third row and the seventh row of Table 1, if the expressions PP _i and QQ _i replace the expressions P _i and Q _i respectively, then we can easily get the L_MMB algorithm. The L_MMB algorithm has the same level of time complexity as the E_MMB algorithm.

此外，当单独地集成图像或文本的多类弱学习器时，单模态的多类Boosting算法的时间复杂度为O(μλ)或O(τλ)，这意味着它与本实施例的MMB算法有等量的时间复杂度。In addition, when integrating image or text multi-class weak learners separately, the time complexity of multi-class Boosting algorithm for single modality is O(μλ) or O(τλ), which means it is the same as the MMB of this embodiment Algorithms have an equal amount of time complexity.

在单模态的Boosting方法中，如果单模态数据对象具有较低质量的底层内容特征，那么该对象的模态内语义信息并不能很好地被挖掘。另外，单模态的Boosting方法没有涉及到模态间的语义信息。在从底层特征到高层语义特征投影的过程中，一个有效的映射机制应该结合所有的模态，使得模态内与模态间的语义信息被保留下来。在多模态的Boosting方法中，一对多模态数据可以通过最小化模态间的损耗来缩短它们之间的拟边缘距离，这使得语义相关的对象在语义空间中能够聚集在一起。为了挖掘每个模态的内部语义信息，模态内的损耗也被尽量地减少。除此之外，如果模态内的语义信息来自于较低质量的数据对象，那么通过补充相应的模态间语义相关性，这些语义信息可以在一定程度上得到增强。In the unimodal Boosting method, if the unimodal data object has low-quality underlying content features, the intra-modal semantic information of the object cannot be well mined. In addition, the unimodal Boosting method does not involve the semantic information between the modalities. In the process of projecting from low-level features to high-level semantic features, an effective mapping mechanism should combine all modalities so that the semantic information within and between modalities is preserved. In the multimodal Boosting method, one-to-many modal data can shorten the quasi-edge distance between them by minimizing the loss between modalities, which enables semantically related objects to be clustered together in the semantic space. In order to mine the internal semantic information of each modality, the loss within the modality is also minimized. Besides, if the intra-modal semantic information comes from lower-quality data objects, then these semantic information can be enhanced to some extent by supplementing the corresponding inter-modal semantic correlation.

步骤3)中，对于任意的图像数据x，我们可以用图像预测器f^λ(x)来计算该图像关于第k个语义类的后验概率In step 3), for any image data x, we can use the image predictor f ^λ (x) to calculate the posterior probability of the image with respect to the kth semantic class

${π π}_{k k}^{I I} = = P P ((s the s = = k k | | x x)) = = \frac{σ σ ((< < {f f}^{λ λ} ((x x)),, {c c}^{k k} > >))}{{Σ Σ}_{k k} σ σ ((< < {f f}^{λ λ} ((x x)),, {c c}^{k k} > >))} - - - - - - ((22 twenty two))$

其中σ(·)是Sigmoid函数。类似地，文本数据y关于第k个语义类的后验概率为where σ(·) is the Sigmoid function. Similarly, the posterior probability of text data y with respect to the kth semantic class is

${π π}_{k k}^{T T} = = P P ((s the s = = k k | | y the y)) = = \frac{σ σ ((< < {u u}^{λ λ} ((y the y)),, {c c}^{k k} > >))}{{Σ Σ}_{k k} σ σ ((< < {u u}^{λ λ} ((y the y)),, {c c}^{k k} > >))} - - - - - - ((23 twenty three))$

给出一个模态的查询数据q与另一个模态的被检索对象，根据公式(22)与公式(23)的映射机制，我们可以得到这些数据的语义特征向量。在语义空间内，传统的距离度量可以被用来执行跨模态检索。Given the query data q of one modality and the retrieved object of another modality, according to the mapping mechanism of formula (22) and formula (23), we can get the semantic feature vector of these data. Within the semantic space, traditional distance measures can be used to perform cross-modal retrieval.

本发明的用于跨模态检索的多模态多类Boosting框架构建装置实施例Embodiment of the multi-modal and multi-class Boosting framework construction device for cross-modal retrieval of the present invention

本实施例的装置用于实施上述用于跨模态检索的多模态多类Boosting框架构建方法，该装置包括目标函数构建模块、最优预测器学习模块和语义空间转换模块；The device of this embodiment is used to implement the above-mentioned method for constructing a multi-modal and multi-class Boosting framework for cross-modal retrieval, and the device includes an objective function building module, an optimal predictor learning module and a semantic space conversion module;

最优预测器学习模块，用于根据梯度下降策略，依次更新目标风险函数中每一个模态的预测器，而固定其他M-1个模态的预测器，当所有模态的预测器均被更新后，称为一次循环迭代，如此经过T次循环迭代后，学习出使目标函数最小的各模态的最优预测器，其中T≥1；The optimal predictor learning module is used to sequentially update the predictors of each mode in the target risk function according to the gradient descent strategy, and fix the predictors of other M-1 modes. When the predictors of all modes are After updating, it is called a loop iteration. After T loop iterations, the optimal predictor of each mode that minimizes the objective function is learned, where T≥1;

为了进一步展示本实施例的MMB框架的跨模态检索方法的优良性能，将现有技术中的SM、SCM、LCMH算法与本申请中的MMB算法进行比较分析。前两个方法的映射机制被替换成单模态的多类Boosting方法，而第三个方法可以作为一个没有考虑语义特征向量的例子。所有实验的硬件环境：2.93Ghz的主频，双核的CPU(E7500)与2GB内存的台式机；软件环境：Windows XP，开发工具为MATLAB(R2012b)。In order to further demonstrate the excellent performance of the cross-modal retrieval method of the MMB framework in this embodiment, the SM, SCM, and LCMH algorithms in the prior art are compared with the MMB algorithm in this application. The mapping mechanism of the first two methods is replaced by a single-modal multi-class Boosting method, and the third method can be used as an example without considering semantic feature vectors. The hardware environment for all experiments: 2.93Ghz main frequency, dual-core CPU (E7500) and desktop computer with 2GB memory; software environment: Windows XP, and the development tool is MATLAB (R2012b).

实验时，进行了两种任务的检索，第一个是用一个文本查询来检索相关的图像数据，第二个是用一个图像查询来检索相关的文本数据。During the experiment, two retrieval tasks were carried out, the first was to retrieve relevant image data with a text query, and the second was to retrieve relevant text data with an image query.

我们采用了两个基准数据集，即Wiki数据集与NUS-WIDE数据集。Wiki数据集是含有2866个图像文本对的多模态文档集，而每一个文档的语义类取自于最常见的10个语义概念。训练集与测试集分别包含了2173与693个多模态文档，图像与文本的底层特征分别表示成128维的视觉词袋向量与10个主题的概率分布。NUS-WIDE数据集的来源是Flickr上的图像与标注，该数据集最初含有269648幅图像以及由81个真实语义概念所组成的词汇表。Flickr为所有的图像均提供了一些有意义的标注，因此每一个图像与其对应的标注构成了一个多模态文档，即图像文本对。基于数据集中出现频率最高的15个语义概念，我们随机地抽取了4800个多模态文档。每一个多模态文档只能属于一个语义概念类，例如“Animal”、“Buildings”与“Flowers”等。每一个语义概念类涵盖了320个多模态文档，最终的训练集与测试集分别包含了3750与1050个多模态文档。我们的实验分别采用了500维的SIFT特征向量与1000维的词频向量来表示图像与文本的底层特征。We adopt two benchmark datasets, Wiki dataset and NUS-WIDE dataset. The Wiki dataset is a multimodal document set containing 2866 image-text pairs, and the semantic class of each document is taken from the most common 10 semantic concepts. The training set and test set contain 2173 and 693 multimodal documents respectively, and the underlying features of images and texts are expressed as 128-dimensional visual bag-of-words vectors and probability distributions of 10 topics, respectively. The source of the NUS-WIDE dataset is images and annotations on Flickr. The dataset initially contains 269,648 images and a vocabulary composed of 81 real semantic concepts. Flickr provides some meaningful annotations for all images, so each image and its corresponding annotations constitute a multimodal document, that is, an image-text pair. Based on the 15 most frequently occurring semantic concepts in the dataset, we randomly sample 4800 multimodal documents. Each multimodal document can only belong to one semantic concept class, such as "Animal", "Buildings" and "Flowers". Each semantic concept class covers 320 multimodal documents, and the final training set and test set contain 3750 and 1050 multimodal documents, respectively. Our experiments use 500-dimensional SIFT feature vectors and 1000-dimensional word frequency vectors to represent the underlying features of images and texts.

为了保持一致性，我们在SM与SCM方法中采用了规范化的相关性(NC)作为度量，在LCMH方法中采用了汉明距离(HD)作为度量。在MMB框架中，我们用NC来度量查询对象与被检索对象之间的相似性。除此之外，图像与文本的多类弱学习器均是深度为2的决策树，算法1的循环迭代次数被设置为100。如果一个查询对象与返回的某个目标对象都属于相同的语义类，那么这两个对象是相关的。在信息检索领域，精确度(Precision)与召回率(Recall)是衡量检索性能的基础尺度。给出一个查询对象与被检索的对象集，如果检索算法根据相关性的度量输出一个大小为W的序列O_W，那么精确度与召回率的定义分别为For consistency, we adopted normalized correlation (NC) as the metric in the SM and SCM methods and Hamming distance (HD) in the LCMH method. In the MMB framework, we use NC to measure the similarity between the query object and the retrieved object. In addition, the multi-class weak learners for images and text are decision trees with a depth of 2, and the number of loop iterations of Algorithm 1 is set to 100. A query object and a returned target object are related if they both belong to the same semantic class. In the field of information retrieval, precision (Precision) and recall (Recall) are the basic scales to measure retrieval performance. Given a query object and a retrieved object set, if the retrieval algorithm outputs a sequence O _W of size W according to the measure of correlation, then the definitions of precision and recall are respectively

基于公式(25)，平均精确度(Average Precision)的定义如下Based on formula (25), the average precision (Average Precision) is defined as follows

$A A P P = = \frac{11}{E E.} {Σ Σ}_{i i = = 11}^{W W} p p r r e e c c i i s the s i i o o n no ((i i)) \cdot &Center Dot; γ γ ((i i)) - - - - - - ((2727))$

其中E表示序列O_W中相关对象的个数。在公式(27)中，如果序列O_W的第i个对象与查询对象是相关的，那么γ(i)的值为1，否则值为0。通过平均所有查询对象的AP值，我们可以得到MAP(Mean Average Precision)得分。MAP是标准的信息检索度量，较大的MAP值表示较好的检索性能。对于所有跨模态检索方法，本实施例采用的评估度量包括MAP值、PR曲线(11-point Interpolated Precision-recall Curve)与召回率曲线(Recall Curve)。Where E represents the number of related objects in the sequence O _W. In formula (27), if the i-th object of the sequence O _W is related to the query object, then the value of γ(i) is 1, otherwise the value is 0. By averaging the AP values of all query objects, we can get the MAP (Mean Average Precision) score. MAP is a standard information retrieval metric, and a larger MAP value indicates better retrieval performance. For all cross-modal retrieval methods, the evaluation metrics used in this embodiment include MAP value, PR curve (11-point Interpolated Precision-recall Curve) and recall rate curve (Recall Curve).

在Wiki数据集上，我们对比了MMB方法与其他3个跨模态检索方法。表2列出了所有跨模态检索方法的MAP值，其中W表示返回文档的数目，而黑体数值表示最好的检索性能。从该表中可以清楚地看到，E_MMB与L_MMB方法在跨模态检索的两个任务中均超越了其他3个跨模态检索方法，并且获得了较好的平均检索性能。例如，当W＝50时，与SCM的平均MAP值相比，L_MMB的平均MAP值为0.31，大约提高了20.2％；当测试集内的所有对象被返回时，与SCM的平均MAP值相比，L_MMB的平均MAP值大约提高了19.8％，达到了0.23。On the Wiki dataset, we compare the MMB method with 3 other cross-modal retrieval methods. Table 2 lists the MAP values of all cross-modal retrieval methods, where W represents the number of returned documents, and the bold value represents the best retrieval performance. It can be clearly seen from the table that the E_MMB and L_MMB methods surpassed the other three cross-modal retrieval methods in the two tasks of cross-modal retrieval, and achieved better average retrieval performance. For example, when W=50, compared with the average MAP value of SCM, the average MAP value of L_MMB is 0.31, which is about 20.2% improved; when all objects in the test set are returned, compared with the average MAP value of SCM , the average MAP value of L_MMB increased by about 19.8%, reaching 0.23.

表2各种跨模态检索方法在Wiki数据集上的性能对比(MAP)Table 2 Performance comparison of various cross-modal retrieval methods on Wiki datasets (MAP)

单模态的Boosting方法并没有考虑模态间的语义信息，这可能使得图像文本对在语义空间内有较远的距离，因此SM与SCM产生较小的MAP值。一方面，模态内的语义信息反映了每个模态的语义表达能力，而通过最小化模态间损耗得到的模态间语义信息更关注于不同模态之间的相关性。这两种类型的语义信息在跨模态检索的过程中都有重要的作用，并且二者是互补的。因此，它们的结合有益于检索性能的提高。另一方面，通过最小化模态内的损耗，拥有高质量底层特征的模态数据能够获得质量较好的模态内语义信息；同时，模态间的语义相关性在一定程度上可以增强那些质量较差的模态内语义信息。这些原因使得MMB框架在跨模态检索的任务中拥有较好的性能。此外，与SM或SCM方法相比，LCMH在跨模态检索的两个任务中拥有较小的MAP值。产生这种现象的原因是，SM与SCM通过单模态的Boosting方法获得了图像与文本的高层语义特征，而LCMH并没有考虑语义概率空间内的语义信息。The unimodal Boosting method does not consider the semantic information between the modalities, which may cause the image-text pair to have a longer distance in the semantic space, so SM and SCM produce smaller MAP values. On the one hand, the semantic information within a modality reflects the semantic expressiveness of each modality, while the inter-modal semantic information obtained by minimizing the inter-modal loss focuses more on the correlation between different modalities. Both types of semantic information play an important role in the process of cross-modal retrieval, and they are complementary. Therefore, their combination is beneficial to the improvement of retrieval performance. On the other hand, by minimizing intra-modal loss, modal data with high-quality underlying features can obtain better-quality intra-modal semantic information; at the same time, inter-modal semantic correlation can enhance those Intra-modal semantic information of poor quality. These reasons make the MMB framework have better performance in the task of cross-modal retrieval. Furthermore, LCMH possesses smaller MAP values in both tasks of cross-modal retrieval compared with SM or SCM methods. The reason for this phenomenon is that SM and SCM obtain the high-level semantic features of images and texts through the single-modal Boosting method, while LCMH does not consider the semantic information in the semantic probability space.

为了获得更详细的分析，图2画出了LCMH、SM、SCM、E_MMB与L_MMB在Wiki数据集上的PR曲线。从该图中可以看出，MMB框架又一次在跨模态检索的两个任务中超越了其他3个跨模态检索方法。例如，MMB的精确度在图像查询的任务中获得了大辐度的提高，并且这些收益出现在所有的召回率水平上。图2的实验结果表明，MMB框架具有较高的准确率与较好的泛化能力。在返回的序列中，随着被检索对象的数目增加，召回率曲线可以反映召回率的具体变化。图3给出了LCMH、SM、SCM、E_MMB与L_MMB在Wiki数据集上的召回率曲线。从该图中可以看到，MMB框架的召回率曲线一直在其他跨模态检索方法的召回率曲线上方。这个实验结果表明，当被检验对象的数目相同时，MMB框架可以获得较好的召回率。也就是说，MMB框架能够把更多的相关对象放在返回序列的前部分。总体而言，通过结合模态内的语义信息与模态间的语义相关性，MMB框架可以有效地提高跨模态检索的性能。In order to obtain a more detailed analysis, Figure 2 draws the PR curves of LCMH, SM, SCM, E_MMB and L_MMB on the Wiki dataset. It can be seen from this figure that the MMB framework outperforms the other 3 cross-modal retrieval methods in two tasks of cross-modal retrieval again. For example, MMB achieves large gains in precision on the image query task, and these gains occur at all recall levels. The experimental results in Figure 2 show that the MMB framework has high accuracy and good generalization ability. In the returned sequence, as the number of retrieved objects increases, the recall rate curve can reflect the specific change of the recall rate. Figure 3 shows the recall rate curves of LCMH, SM, SCM, E_MMB and L_MMB on the Wiki dataset. As can be seen from this figure, the recall curve of the MMB framework is consistently above that of other cross-modal retrieval methods. This experimental result shows that the MMB framework can achieve a better recall rate when the number of tested objects is the same. That is, the MMB framework can put more related objects at the front of the returned sequence. Overall, the MMB framework can effectively improve the performance of cross-modal retrieval by combining the semantic information within a modality with the inter-modal semantic correlation.

我们也在NUS-WIDE数据集上对比了MMB框架与其他3个跨模态检索方法的评估性能。表3给出了所有跨模态检索方法的MAP值，其中W表示被检验对象的数目，而黑体数值表示最高的性能。可以看出，MMB框架再次超越了其他的跨模态检索方法，并且达到了最好的平均检索性能。例如，与SM的平均MAP值相比，当W＝50时，L_MMB的平均MAP值约为0.24，大约提高了39.3％；当测试集内的所有对象被返回时，L_MMB的平均MAP值大约提高了40.7％，达到了0.17。We also compare the evaluation performance of the MMB framework with other 3 cross-modal retrieval methods on the NUS-WIDE dataset. Table 3 presents the MAP values of all cross-modal retrieval methods, where W indicates the number of objects examined, and bold values indicate the highest performance. It can be seen that the MMB framework outperforms other cross-modal retrieval methods again and achieves the best average retrieval performance. For example, compared with the average MAP value of SM, when W=50, the average MAP value of L_MMB is about 0.24, which is about 39.3% higher; when all objects in the test set are returned, the average MAP value of L_MMB is about increased by 40.7%, reaching 0.17.

表3各种跨模态检索方法在NUS-WIDE数据集上的性能对比(MAP)Table 3 Performance comparison of various cross-modal retrieval methods on the NUS-WIDE dataset (MAP)

类似地，图4与图5分别画出了LCMH、SM、SCM、E_MMB与L_MMB在NUS-WIDE数据集上的PR曲线与召回率曲线。可以看出，MMB框架再次表现出最好的跨模态检索性能。此外，对于PR曲线与召回率曲线，MMB框架在NUS-WIDE数据集与Wiki数据集上保持了高度一致性。例如，在图2与图4的所有召回率水平上，E_MMB与L_MMB的PR曲线在图像查询的任务中均超过其他方法的PR曲线，获得了较大辐度的收益。对于NUS-WIDE数据集上的实验结果，其主要原因在于MMB框架结合了模态内的语义信息与模态间的语义相关性。Similarly, Figure 4 and Figure 5 respectively draw the PR curve and recall rate curve of LCMH, SM, SCM, E_MMB and L_MMB on the NUS-WIDE dataset. It can be seen that the MMB framework again exhibits the best cross-modal retrieval performance. In addition, for the PR curve and the recall rate curve, the MMB framework maintains a high degree of consistency on the NUS-WIDE dataset and the Wiki dataset. For example, at all recall levels in Figure 2 and Figure 4, the PR curves of E_MMB and L_MMB exceed the PR curves of other methods in the task of image query, and obtain a larger range of benefits. For the experimental results on the NUS-WIDE dataset, the main reason is that the MMB framework combines the semantic information within a modality with the semantic correlation between modalities.

Claims

1. the multimodal multiclass Boosting framework construction method for cross-modal retrieval, it is characterized in that, the method comprises the steps:

1) Construct the target risk function R[f ₁ ,...,f _M ], the target risk function includes the intra-modal loss of each mode and the inter-modal loss between each mode, where f ₁ is the first The predictor of the modality, f _M is the predictor of the Mth modality, M≥2;

2) According to the gradient descent strategy, update the predictor of each mode in the target risk function in turn, and fix the predictors of other M-1 modes. When the predictors of all modes are updated, it is called a cycle Iteration, so that after T times of loop iterations, the optimal predictor of each mode that minimizes the target risk function is learned, where T≥1;

3) Transform the quasi-edges produced by the optimal predictors of each modality into a common semantic space to enable cross-modal retrieval.

2. the multimodal multiclass Boosting framework construction method for cross-modal retrieval according to claim 1, is characterized in that, described step 2) in the process of a loop iteration is:

A) Calculate the weight of each modal document according to the gradient descent strategy;

B) According to the weight of each modal document, find the first-order functional partial derivative of the objective function along the direction of the multi-class learner in the vicinity of the updated predictor, and then find the maximum degree in the functional space A multi-class learner that minimizes risk, that is, finds the optimal direction in the functional space;

C) Using the multi-class learners obtained in step B), obtain the optimal step size along the optimal direction, and update the predictor according to the optimal step size.

3. the method for building multimodal and multiclass Boosting frameworks for cross-modal retrieval according to claim 2, wherein step A) is based on multiple classes of each modality when calculating the weight of each modality Exponential loss, multi-class exponential loss for a mode is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

4. the multimodal multiclass Boosting framework construction method for cross-modal retrieval according to claim 2, is characterized in that, in step A) when calculating the weight of each modality, be based on the multiclass of each modality Logical loss, the multi-class logical loss of a certain mode is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

5. The method for constructing a multimodal and multiclass Boosting framework for cross-modal retrieval according to claim 1, wherein the risk function R [f ₁ ,..., f _M ] is expressed as:

\begin{matrix} R R [[{f f}_{11},, ... ...,, {f f}_{M m}]] = = {Σ Σ}_{m m = = 11}^{M m} {R R}_{m m} [[{f f}_{m m} (({z z}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {R R}_{m m j j} [[{f f}_{m m} (({z z}^{m m})),, {f f}_{j j} (({z z}^{j j}))]] \\ = = {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{i i = = 11}^{N N} {L L}_{m m} [[{s the s}_{i i},, {f f}_{m m} (({z z}_{i i}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {Σ Σ}_{i i = = 11}^{N N} | | | | {C C}^{T T} [[{f f}_{m m} (({z z}_{i i}^{m m})) - - {f f}_{j j} (({z z}_{i i}^{j j}))]] | | {| |}_{22}^{22} \end{matrix}

in, represents the i-th data object of the m-th modality, and L _m [·] and f _m (·) represent the multi-class loss function and predictor of the m-th modality, respectively, Indicates the intermodal loss between the mth and jth modals with respect to the ith data.

6. The multimodal multiclass Boosting frame construction device for cross-modal retrieval is characterized in that the device includes an objective function construction module, an optimal predictor learning module and a semantic space conversion module;

The objective function building block is used to construct the objective risk function R[f ₁ ,...,f _M ], the objective risk function includes the intra-modal loss of each mode and the inter-modal loss between each mode, where, f ₁ is the predictor of the first mode, f _M is the predictor of the Mth mode, M≥2;

The optimal predictor learning module is used to sequentially update the predictors of each mode in the target risk function according to the gradient descent strategy, and fix the other M-1 mode predictors. When all the mode predictors are After updating, it is called a loop iteration. After T loop iterations, the optimal predictor of each mode that minimizes the objective function is learned, where T≥1;

The semantic space conversion module is used to convert the quasi-edges produced by the optimal predictors of each modality into a common semantic space for cross-modal retrieval.

7. according to claim 6, the multimodal and multiclass Boosting frame construction device for cross-modal retrieval is characterized in that, the process of one cycle iteration is:

8. according to claim 7, be used for the multimodal multiclass Boosting frame construction device of cross-modal retrieval, it is characterized in that, when calculating the weight of each modality, be based on the multiclass index loss of each modality, a certain The multiclass exponential loss for the modes is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

9. the multimodal multiclass Boosting frame construction device for cross-modal retrieval according to claim 7, characterized in that, when calculating the weight of each modality, it is based on the multiclass logic loss of each modality, a certain The multiclass logistic loss for a modality is defined as: Among them, f(x) is the predictor of a certain modality, K is the number of semantic classes in the semantic vocabulary, <f(x), c ^k -c ^s > means that the predictor of a certain modality is related to the kth and The quasi-margin difference of the s-th semantic class, c ^k and c ^s denote the codebook vectors corresponding to the k-th and s-th semantic classes, respectively.

10. The multimodal and multiclass Boosting framework construction device for cross-modal retrieval according to claim 6, wherein the risk function R[f ₁ ,...,f _M ] is expressed as:

\begin{matrix} R R [[{f f}_{11},, ... ...,, {f f}_{M m}]] = = {Σ Σ}_{m m = = 11}^{M m} {R R}_{m m} [[{f f}_{m m} (({z z}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {R R}_{m m j j} [[{f f}_{m m} (({z z}^{m m})),, {f f}_{j j} (({z z}^{j j}))]] \\ = = {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{i i = = 11}^{N N} {L L}_{m m} [[{s the s}_{i i},, {f f}_{m m} (({z z}_{i i}^{m m}))]] + + {Σ Σ}_{m m = = 11}^{M m} {Σ Σ}_{j j > > m m}^{M m} {Σ Σ}_{i i = = 11}^{N N} | | | | {C C}^{T T} [[{f f}_{m m} (({z z}_{i i}^{m m})) - - {f f}_{j j} (({z z}_{i i}^{j j}))]] | | {| |}_{22}^{22} \end{matrix}