WO2023020214A1 - 检索模型的训练和检索方法、装置、设备及介质 - Google Patents

检索模型的训练和检索方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023020214A1
WO2023020214A1 PCT/CN2022/107973 CN2022107973W WO2023020214A1 WO 2023020214 A1 WO2023020214 A1 WO 2023020214A1 CN 2022107973 W CN2022107973 W CN 2022107973W WO 2023020214 A1 WO2023020214 A1 WO 2023020214A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
network
triplet
training
quantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/107973
Other languages
English (en)
French (fr)
Inventor
郭卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to EP22857527.0A priority Critical patent/EP4386579B1/en
Publication of WO2023020214A1 publication Critical patent/WO2023020214A1/zh
Priority to US18/134,447 priority patent/US12493649B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of artificial intelligence, in particular to a retrieval model training and retrieval method, device, equipment and medium.
  • the feature vector of the query image is usually obtained through the embedded vector network, and the feature vector is quantized by PQ (Product Quantization, product quantization) to obtain the quantization index, and then m feature vectors matching the quantization index are found from the quantization codebook , recall m candidate images corresponding to m feature vectors, and then sort the results according to the distance between the feature vectors of the m candidate images and the feature vector of the query image, and select the candidate image with a higher ranking as the final recalled image.
  • PQ Process Quantization, product quantization
  • PQ quantization divides the value of each dimension of the feature vector into multiple segments, and each segment is represented by a different number (such as a floating-point number with a bit value between 0 and 1, it may be divided into 0.1, ... 0.9, 1.0, a total of 10 segments, respectively use numbers from 1 to 10 to indicate the quantization method of each segment), and the candidate images quantized to the same segment will be recalled during retrieval.
  • the PQ quantization method tends to cause similar features to be split into two adjacent segments, and critical samples are prone to miss recall or multiple recalls (for example, the critical samples of segment 1 and segment 2 may be similar to segment 1 or segment 2 Similarly, there will be missed recalls when recalling any segment alone, and recalling two segments will increase false recalls).
  • the present application provides a retrieval model training and retrieval method, device, equipment and medium, which can improve retrieval accuracy. Described technical scheme is as follows:
  • a method for training a retrieval model is provided, the method is executed by a computer device, the retrieval model includes an embedding vector network and a quantization index network, the embedding vector network is used to obtain the feature vector of the retrieval object, and the quantization index network A quantitative index for extracting a search object; the method includes:
  • sample triplets for training the retrieval model; sample triplets include training samples, positive samples that form similar sample pairs with the training samples, and negative samples that do not form similar sample pairs with the training samples; n is greater than 1 a positive integer;
  • the basic eigenvectors of n sample triplets are input into the embedding vector network; according to the error of the eigenvectors output by the embedding vector network, the first sample triplet set for training the quantization index network is selected;
  • a quantized index network is trained based on the first set of sample triples, and an embedding vector network is trained based on the second set of sample triples.
  • an object retrieval method is provided, the method is executed by a computer device, and the method includes:
  • the feature vectors of the m candidate objects are indexed from the quantization codebook; the quantization codebook stores the mapping relationship between the quantization index and the feature vectors of the m candidate objects; m is a positive integer;
  • the candidate images corresponding to the fifth distances ranked in the top z% are selected.
  • a retrieval model training device includes an embedding vector network and a quantization index network, the embedding vector network is used to obtain the feature vector of the retrieval object, and the quantization index network is used to extract the Quantitative index; means include:
  • the obtaining module is used to obtain n sample triplets for training the retrieval model;
  • the sample triplets include training samples, positive samples that form similar sample pairs with the training samples, and negative samples that do not form similar sample pairs with the training samples ;
  • the screening module is used to input the basic feature vectors of n sample triples into the embedding vector network; according to the feature vector output by the embedding vector network, filter out the first sample triple set for training the quantization index network;
  • the screening module is also used to input the basic feature vectors of n sample triples into the quantization index network; according to the quantization index output by the quantization index network, filter out the second sample triple set for training the embedding vector network;
  • the training module is used for training the quantization index network based on the first sample triple set, and training the embedding vector network based on the second sample triple set.
  • an object retrieval device includes:
  • An acquisition module configured to acquire the basic feature vector of the query object
  • the input module is used to input the basic feature vector to the quantization index network and the embedding vector network;
  • the obtaining module is also used to obtain the quantized index of the query object through the quantitative index network, and obtain the feature vector of the query object through the embedded vector network;
  • An index module configured to index and obtain feature vectors of m candidate objects from the quantization codebook based on the quantization index; the quantization codebook stores a mapping relationship between the quantization index and the feature vectors of the m candidate objects;
  • a calculation module configured to calculate the fifth distance between the feature vectors of the m candidate objects and the feature vector of the query object, to obtain m fifth distances
  • the screening module is configured to, among the sorting results of the m fifth distances in ascending order, select candidates corresponding to the fifth distances in the top z%.
  • a computer device includes: a processor and a memory, the memory stores a computer program, the computer program is loaded and executed by the processor to implement the above-mentioned The retrieval model training method, and/or, the object retrieval method.
  • a computer-readable storage medium stores a computer program, and the computer program is loaded and executed by a processor to implement the above-mentioned retrieval model training method, and /or, object retrieval method.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned retrieval model training method and/or object retrieval method.
  • the embedding vector network and the quantization index network can remove the noise in the sample triplets through screening, and use the first sample triplet set to train the quantization index network, so that the quantization index network can learn To the noise recognition ability of the embedded vector network, use the second sample triplet set to train the embedded vector network, so that the embedded vector network can learn the noise recognition ability of the quantized index network, and promote the noise recognition ability of the embedded vector network and the quantized index network Make progress together.
  • the above training method of the retrieval model eliminates the influence of the noise sample triplet in the sample triplet, and at the same time makes the prediction effect of the quantized index network and the embedded vector network similar to the positive and negative samples, and predicts the noise sample triplet through the double branch , and make the two branches learn excellent sample triplets from each other to achieve denoising learning, so that the two branches have similar prediction effects.
  • FIG. 1 is a schematic diagram of PQ quantization provided by an exemplary embodiment of the present application
  • Fig. 2 is a schematic diagram of the implementation environment of the retrieval model of an exemplary embodiment of the present application
  • Fig. 3 is the flowchart of the training method of the retrieval model of an exemplary embodiment of the present application
  • Fig. 4 is the schematic diagram of the training system of the retrieval model of an exemplary embodiment of the present application.
  • Fig. 5 is a flowchart of a method for training a retrieval model according to another exemplary embodiment of the present application.
  • FIG. 6 is a flowchart of a method for training a retrieval model according to another exemplary embodiment of the present application.
  • FIG. 7 is a flowchart of a method for training a retrieval model according to another exemplary embodiment of the present application.
  • Fig. 8 is a schematic diagram of a training system of a retrieval model according to another exemplary embodiment of the present application.
  • FIG. 9 is a flowchart of a method for training a retrieval model according to another exemplary embodiment of the present application.
  • Fig. 10 is a flowchart of an image retrieval method provided by an exemplary embodiment of the present application.
  • Fig. 11 is a schematic diagram of a system using a retrieval model provided by an exemplary embodiment of the present application.
  • Fig. 12 is a flowchart of an audio retrieval method provided by an exemplary embodiment of the present application.
  • Fig. 13 is a flowchart of a video retrieval method provided by an exemplary embodiment of the present application.
  • Fig. 14 is a structural block diagram of a training device for a retrieval model provided by an exemplary embodiment of the present application.
  • Fig. 15 is a structural block diagram of an object retrieval device provided by an exemplary embodiment of the present application.
  • Fig. 16 is a structural block diagram of a computer device provided by an exemplary embodiment of the present application.
  • Fig. 17 is a schematic diagram of a data sharing system provided by an exemplary embodiment of the present application.
  • Fig. 18 is a schematic diagram of a block chain structure provided by an exemplary embodiment of the present application.
  • Fig. 19 is a schematic diagram of a new block generation process provided by an exemplary embodiment of the present application.
  • N training samples In the training phase, refer to Figure 1, for N training samples, assume that the sample dimension is 128 dimensions, and divide it into 4 subspaces, then the dimension of each subspace is 32 dimensions, in each subspace , use k-means (a clustering algorithm) to cluster the feature vectors (256 cluster centers shown in Figure 1), and a codebook can be obtained for each subspace. And each sub-segment of each training sample can be approximated by the cluster center of the subspace, and the corresponding code is the value of the cluster center. The finally obtained N training samples constitute an index codebook.
  • k-means a clustering algorithm
  • For the samples to be quantized perform the same segmentation, and then find the cluster centers closest to them in each subspace one by one, and then use the value of the cluster center to represent each sub-segment, that is, complete the quantization of the samples to be quantified index vector.
  • the feature vector of the query image is divided into 4 sub-segments, and then in each sub-space, the distance from the sub-segment to all the cluster centers in the sub-space can be calculated, and 4*256 distances can be obtained, and these are calculated distance as a distance table.
  • Deep symbolic quantization network (quantization index network): First, for the D-dimensional feature vectors of N training samples, after vector normalization, the value range of each dimension is a floating-point number from -1 to 1, and the D-dimensional feature vector is compressed to Specify the binary code whose number of digits is 0 or 1, which is symbol quantization. For example, (-1, 1, 0.5, -0.2) is obtained after normalizing the 4-dimensional feature vector, and (0, 1, 1, 0) quantization index is obtained after sign quantization.
  • Computer Vision Technology (Computer Vision, CV): Computer vision is a science that studies how to make machines "see”. More specifically, it refers to machine vision that uses cameras and computers instead of human eyes to identify, track and measure targets. , and further do graphics processing, so that the computer processing becomes an image that is more suitable for human observation or sent to the instrument for detection. As a scientific discipline, computer vision studies related theories and technologies, trying to build artificial intelligence systems that can obtain information from images or multidimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition, computer text recognition), video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual Reality, augmented reality, simultaneous positioning and map construction technologies, as well as common biometric identification technologies such as face recognition and fingerprint recognition.
  • OCR Optical Character Recognition, computer text recognition
  • video processing video semantic understanding, video content/behavior recognition
  • 3D object reconstruction 3D technology
  • virtual Reality augmented reality
  • simultaneous positioning and map construction technologies as well as common biometric identification technologies such as face recognition and fingerprint recognition.
  • Artificial intelligence cloud service The so-called artificial intelligence cloud service is generally also called AIaaS (AI as a Service, Chinese is "AI as a service”).
  • AIaaS AI as a Service
  • Chinese is "AI as a service”
  • This service model is similar to opening an AI-themed mall: all developers can access one or more artificial intelligence services provided by the platform through the API (Application Program Interface) interface.
  • API Application Program Interface
  • Some senior developers can also use the AI framework and AI infrastructure provided by the platform to deploy and maintain their own cloud AI services.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of the relevant countries and regions.
  • search objects involved in this application are all obtained under full authorization.
  • FIG. 2 shows a computer system 200 where the retrieval model training method and usage method provided by an exemplary embodiment of the present application are located.
  • the terminal 220 is used for training the retrieval model and/or using the retrieval model
  • the server 240 is used for training the retrieval model and/or using the retrieval model.
  • the terminal 220 is installed and runs a client that supports retrieving objects.
  • the client can be any one of application programs, web pages and applets that support object retrieval.
  • the retrieved object includes but not limited to at least one of image, video, audio, and text.
  • Terminal 220 may train a retrieval model and/or use a retrieval model.
  • the client installed on the terminal 220 is a client on an operating system platform (Android or IOS).
  • the terminal 220 may generally refer to one of multiple terminals, and this embodiment only uses the terminal 220 as an example for illustration.
  • the device type of the terminal 220 includes: at least one of a smart phone, a smart watch, a smart TV, a tablet computer, an e-book reader, an MP3 player, an MP4 player, a laptop computer, and a desktop computer.
  • a smart phone a smart watch
  • a smart TV a tablet computer
  • an e-book reader an MP3 player
  • an MP4 player a laptop computer
  • a desktop computer a desktop computer.
  • the terminal 220 is connected to the server 240 through a wireless network or a wired network.
  • the server 240 includes at least one of a server, multiple servers, a cloud computing platform, and a virtualization center.
  • the server 240 includes a processor 244 and a memory 242, and the memory 242 further includes a receiving module 2421, a control module 2422, and a sending module 2423, and the receiving module 2421 is used to receive requests sent by clients, such as training retrieval models and using retrieval models
  • the control module 2422 is used to control the training and use of the retrieval model; the sending module 2423 is used to send a response to the terminal, such as returning the retrieved object to the client, or returning the trained retrieval model.
  • the server 240 is used to provide background services for object retrieval and/or retrieval model training.
  • the server 240 undertakes the main calculation work, and the terminal 220 undertakes the secondary calculation work; or, the server 240 undertakes the secondary calculation work, and the terminal 220 and the second terminal 160 undertake the main calculation work; or, between the server 240 and the terminal 220 Use distributed computing architecture for collaborative computing.
  • the computer device can be a terminal or a server.
  • This application does not limit the training method of the retrieval model and the execution subject of the retrieval method. .
  • FIG. 3 is a flowchart of a retrieval model training method according to an exemplary embodiment of the present application.
  • the method is executed by the computer device shown in FIG. 2 as an example, wherein the retrieval model includes an embedding vector network and a quantization index network, the embedding vector network is used to obtain the feature vector of the retrieval object, and the quantization index network is used to Extracting the quantitative index of the retrieval object, the method includes:
  • Step 320 obtaining n sample triplets for training the retrieval model
  • the sample triplet includes a training sample, a positive sample that forms a similar sample pair with the training sample, and a negative sample that does not form a similar sample pair with the training sample, and n is a positive integer greater than 1.
  • Image retrieval By querying the feature vector of the image, the quantization index is obtained, and then m feature vectors matching the quantization index are found from the quantization codebook, and the distance between the m feature vectors and the feature vector of the query image is calculated, and the selection and ranking are top The image corresponding to the distance of is used as the final screened image, and m is a positive integer.
  • Audio retrieval By querying the feature vector of the audio, the quantization index is obtained, and then a feature vector matching the quantization index is found from the quantization codebook, and the distance between the feature vector and the query audio feature vector is calculated, and the selection and ranking are top
  • the audio corresponding to the distance is the final filtered audio, and a is a positive integer.
  • Video retrieval By querying the feature vector of the video, the quantization index is obtained, and then the b feature vectors matching the quantization index are found from the quantization codebook, and the distance between the b feature vectors and the feature vector of the query video is calculated, and the selection and ranking are top
  • the video corresponding to the distance of is used as the final screened video, and b is a positive integer.
  • Quantified index refers to the quantified index used to retrieve the search object.
  • a quantitative index corresponds to a type of image, for example, the image categories of all sample images are divided into: person category and animal category, then there is a quantitative index corresponding to the person type, and there is another A quantitative index corresponds to an animal class.
  • the sample images include a dog image and a cat image, wherein the quantization index of the dog image is (1, 0, 1), and the quantization index of the cat image is (1, 1, 1). If the quantization index of the retrieved image is (1, 0, 1), then it can be determined through quantitative retrieval that the retrieved image belongs to a dog image.
  • Quantified index network refers to the neural network used to extract the quantitative index of the retrieval object.
  • Embedded feature network refers to the neural network used to obtain the feature vector of the retrieved object.
  • the embedded feature network performs secondary encoding processing on the input basic feature vector to obtain triplet feature vectors.
  • the size of the input basic feature vector is 1 ⁇ 2048
  • the embedded feature network performs secondary encoding processing on the input basic feature vector to reduce the size of the basic feature vector to obtain a triplet feature vector, a triplet feature
  • the size of the vector is 1 ⁇ 64.
  • the retrieval object includes but not limited to at least one of image, video and audio.
  • the search object pair image is taken as an example for description.
  • the retrieval model is often trained with sample triples.
  • Step 340 inputting the basic feature vectors of n sample triplets into the embedding vector network; according to the error of the feature vectors output by the embedding vector network, selecting the first set of sample triplets for training the quantization index network;
  • Basic feature vector refers to the vector of the basic features of the image in the sample triplet.
  • the basic features of the image include but are not limited to color features, texture features, shape features and spatial relationship features.
  • the embedded vector network generates n sets of triplet feature vectors from the basic feature vectors of n sample triplets, and selects the first set of sample triplets for training the quantization index network based on n sets of triplet feature vectors .
  • Step 360 input the basic feature vectors of n sample triplets into the quantization index network; according to the error of the quantization index output by the quantization index network, select a second set of sample triplets for training the embedding vector network;
  • the quantization index network generates n groups of triplet quantization indexes from the basic feature vectors of n sample triplets, and selects a second set of sample triplets for training the embedding vector network.
  • Step 381 train the quantized index network based on the first sample triplet set
  • Step 382 train the embedding vector network based on the second sample triplet set.
  • a quantized index network is trained based on the first set of sample triples, and an embedding vector network is trained based on the second set of sample triples.
  • the first sample triplet set is input into the quantization index network to obtain n 1 third error losses; based on the n 1 third error losses, the quantization index network is trained, and n 1 is a positive integer smaller than n.
  • n 2 is a positive integer smaller than n.
  • training the embedding vector network and the quantization index network through sample triplets not only uses the first sample triplet set filtered by the embedding vector network as the sample triplets for training the quantization index network, but also passes The second sample triplet set screened by the quantized index network is used as the sample triplet for training the embedding vector network.
  • the above-mentioned retrieval model training method eliminates the influence of the noise sample triplet in the sample triplet, and at the same time makes the quantized index network Similar to the prediction effect of the embedded vector network on positive and negative samples, the noise sample triplet is predicted through the double branch, and the double branch learns the excellent sample triplet from each other to realize denoising learning, so that the double branch has a similar prediction effect .
  • FIG. 4 is a retrieval model training system provided by an exemplary embodiment of the present application, wherein the retrieval model training system includes an embedding vector network 401 and a quantization index network 402 .
  • the embedding vector network 401 is used to generate n groups of triplet feature vectors from the basic feature vectors of n sample triplets, and the triplet feature vectors can be used to calculate the relationship between training samples, positive samples, and negative samples in the sample triplets. similarity distance.
  • the quantization index network 402 is used to generate n sets of triplet quantization indices from the basic feature vectors of n sample triplets, and the triplet quantization indices are used to quantize the sample triplets.
  • the training system of the retrieval model shown in FIG. 4 constructs sample triplets from samples, and inputs the constructed n sample triplets into the embedding vector network 401 and the quantization index network 402 respectively, and then screens the sample triplets. , calculate the loss function of the quantized index network through the first sample triple set filtered by the embedding vector network 401, and train the quantized loss network 402 based on the loss function value; the second sample triple obtained through the quantized index network 402 The set calculates the loss function of the embedding vector network, and trains the embedding vector network 401 based on the loss function value.
  • the above-mentioned screening of n sample triplets through two branches is based on the principle that noise sample triplets (sample triplets in which negative samples and training samples constitute positive sample pairs) are generated in the embedding vector network 401
  • the performance of the quantization index network 402 is not necessarily consistent.
  • the negative samples in the noise sample triplet and the training sample are considered to constitute a positive sample pair through the embedded vector network 401, that is, a high loss function value is generated, and the noise sample triplet Negative samples in and training samples are considered to constitute a negative sample pair through the quantization index network 402, resulting in a low loss function value.
  • the embedding vector network 401 and the quantization index network 402 can be better trained.
  • step 320 n sample triplets for training the retrieval model are obtained.
  • Similar sample pairs In the retrieval model of this application, similar sample pairs need to be introduced to train the embedding vector network and the quantized index network.
  • the similar sample pair includes the target image, which is extremely similar or the same positive sample as the target image. For example, two adjacent frames of images in a video can construct a similar sample pair.
  • the similar sample pair includes the target video, and positive samples that are extremely similar or identical to the target video.
  • the similar sample pairs include the target audio, and positive samples that are extremely similar or identical to the target audio, for example, adjacent two seconds of audio can construct a similar sample pair.
  • Sample triplet In the retrieval model of this application, in order to generate n sample triplets, in each batch of R similar sample pairs, for each training sample in a similar sample pair, from the rest Randomly select an image from each of the (R-1) similar sample pairs, and calculate the distance between the feature vectors of the (R-1) images and the feature vectors of the training samples, and the (R-1) The distances are sorted from small to large, and the similar sample pairs of images corresponding to the first n distances and training samples are selected to form n sample triplets. That is, get n sample triplets for training the retrieval model.
  • Similar sample pairs include (A, A'), (B, B'), (C, C'), (D, D'), ..., (Z, Z'), for similar sample pairs (A, A'),
  • A is the training sample (anchor)
  • A' is the positive sample (positive) from the remaining similar sample pairs (B, B'), (C, C'), (D, D'), ...
  • Each similar sample pair in (Z, Z') randomly selects an image, such as selecting B, C', D, E'..., Z, and calculates the eigenvectors of B, C', D, E'..., Z respectively
  • the distance from the feature vector of the training sample A is 25 distances, and the 25 distances are sorted from small to large, and the image corresponding to the first 20 distances is selected and the similar sample pair (A, A') where the training sample A is located constitutes 20 sample triplets. That is, obtain 20 sample triplets for training the retrieval model.
  • the above method realizes the construction of n sample triplets from similar sample pairs, so that the obtained n sample triplets can be used for the training of the retrieval model provided by this application.
  • Fig. 5 shows the training method of the retrieval model provided by an exemplary embodiment of the present application, and the method is applied to the computer device shown in Fig. 2 as an example for illustration, wherein step 320, step 360, step 381 and step 382 are all Consistent with that shown in Figure 3.
  • step 340 the basic feature vectors of n sample triplets are input into the embedding vector network; according to the error of the feature vectors output by the embedding vector network, the first set of sample triplets for training the quantization index network is selected.
  • Step 340 may include the following steps:
  • Step 341 obtaining n sets of triplet feature vectors output by the embedding vector network for n sample triplets
  • the embedding vector network 401 By inputting n sample triplets to the embedding vector network 401, the embedding vector network 401 outputs n sets of triplet feature vectors.
  • Step 342 calculating n first error losses corresponding to n sets of triplet feature vectors
  • calculating n first error losses corresponding to n groups of triplet feature vectors includes:
  • For each set of triplet feature vectors calculate the first distance between the feature vectors of the training samples and the feature vectors of the positive samples; for each set of triplet feature vectors, calculate the distance between the feature vectors of the training samples and the feature vectors of the negative samples the second distance; calculate the first error loss between the difference between the first distance and the second distance and the first distance threshold, the first distance threshold is the distance between the training sample and the positive sample and the training sample and the negative Threshold for the difference in distance between samples.
  • a anchor
  • p positive
  • n nagative
  • l tri represents the first error loss
  • X a represents the feature vector of the training sample
  • X p represents the feature vector of the positive sample
  • X n represents the feature vector of the negative sample
  • represents the first distance threshold
  • represents the first distance
  • represents the second distance
  • the value of ⁇ is 4, representing the distance between the feature vector of the training sample and the feature vector of the negative sample, ratio
  • the distance between the eigenvectors of the training samples and the eigenvectors of the positive samples is greater than 4; the purpose of using L2 normalization is to make the feature space of triplet samples in the range of 0 to 1, avoiding the fact that the feature space is too large, which is not conducive to optimal training Retrieve the model.
  • Step 343 From the sorting results of the n first error losses from small to large, filter out the sample triplets corresponding to the n 1 first error losses sorted in the first selection range, and add them to the training quantization The set of first-sample triples for the index network.
  • n first error losses can be obtained, and the n first error losses are sorted from small to large, and the sample 3 corresponding to the n +1 first error losses sorted in the first selection range is selected.
  • a tuple added to the first sample triplet set used for training the quantized index network, where n 1 is a positive integer smaller than n.
  • the x value is set based on the proportion of noise sample triplets to n sample triplets, and optionally, the proportion is obtained based on prediction or calibration.
  • the value of x is slightly larger than the proportion of the noise sample triplets in the n sample triplets. x is a positive number.
  • n first error losses constitute a Lem_list
  • n 1 sample triplets corresponding to n 1 first error losses ranked in the top 85% of Lem_list are selected as the first sample triplets group collection.
  • the above method realizes the calculation of the loss function value of n sets of triplet feature vectors of n sample triplets by designing the loss function of the embedded vector network, and based on the loss function value screening is used to train the quantization index network The first sample triplet set of , and further use cleaner sample triplets to optimize the quantization effect of the quantization index network.
  • Fig. 6 shows the training method of the retrieval model provided by an exemplary embodiment of the present application, and the method is applied to the computer device shown in Fig. 2 as an example for illustration, wherein step 320, step 340, step 381 and step 382 are Consistent with that shown in Figure 3.
  • step 360 input the basic feature vectors of n sample triplets into the quantization index network; according to the error of the quantization index output by the quantization index network, select the second sample triplet set for training the embedded vector network;
  • Step 360 may include the following steps:
  • Step 361 obtaining n groups of triplet quantization indexes output by the quantization index network for n sample triplets
  • n sets of triplet quantization indexes are output.
  • Step 362 calculating n second error losses corresponding to n sets of triplet quantization indices
  • the method for calculating n second error losses corresponding to n groups of triplet quantization indexes includes:
  • For each triplet quantization index calculate the first triplet loss of the triplet quantization index; for each triplet quantization index, calculate the first quantization error loss of the triplet quantization index; for the first triplet The group loss and the first quantization error loss are weighted and summed to obtain the second error loss.
  • Calculate the first triplet loss Schematically, activate the triplet quantization index through the activation function, and then calculate the first triplet loss of the quantization index.
  • the sample three The distance between the training sample and the negative sample of the tuple needs to be large enough.
  • set the margin ( ⁇ in formula (1)) to 160.
  • the calculation method of the first triplet loss is similar to the calculation method of the above-mentioned first error loss, and will not be repeated here.
  • the difference here is that the dimension of the triplet quantization index is not consistent with the dimension of the triplet feature vector.
  • the triplet quantization indexes are 256 dimensions, each dimension is converted into a value of (-1, 1) by an activation function, and then symbol quantization is performed, Finally, a 256-dimensional triplet quantization index is obtained.
  • u i is the value of each dimension of the triplet quantization index without sign quantization (the value less than 0 is quantized to 0, and the value greater than 0 is quantized to 1).
  • the range of u i is (-1, 1), b i The value (-1 or 1) of each dimension of the quantized index for the transformed triplet;
  • L coding is the first quantization error loss
  • the above activation function can be set to Tanh function or sgn function. It is worth noting that the sgn function is not derivable at the 0 position (+0 and -0), that is, gradient calculation cannot be performed (and thus cannot be used based on sgd gradient In the deep learning of the back pass), the Tanh activation is derivable and can be mapped to between -1 and 1.
  • the activation function can also use sigmoid (activated between 0 and 1), and then use 0, 1 as the quantization target (instead of -1, 1). Among them, tanh can obtain (-1, 1) faster, and the training effect is better.
  • L q output is the second error loss
  • L triplet is the first triplet loss
  • L coding is the first quantization error loss
  • w 21 and w 22 are weights, optional, due to the first quantization error loss Convergence is faster than the first triplet loss, in order to ensure that the first triplet loss is dominant in the overall second error loss, so as to ensure that the embedded vector network always has the ability to measure similarity, so here w 21 is set to 1 , w 22 is set to 0.5.
  • Step 363 in the sorting results of the n second error losses from small to large, filter out the sample triplets corresponding to the n 2 second error losses sorted in the second selection range, and add them to the training embedding The second set of sample triplets for the vector network.
  • n second error losses can be obtained, and the n second error losses are sorted from small to large, and the sample 3 corresponding to the n 2 second error losses sorted in the second selection range is selected. tuple to add to the second set of sample triplets used to train the quantized index network.
  • the value of y is set based on the ratio of noise sample triplets to n sample triplets, and optionally, the ratio is obtained based on prediction or calibration.
  • the value of y is slightly larger than the proportion of the noise sample triplets in the n sample triplets. y is a positive number.
  • n 2 is a positive integer less than n.
  • the above method realizes the calculation of the loss function value of n groups of triplet quantization indexes of n sample triplets by designing the loss function of the quantization index network, and based on the loss function value screening is used to train the embedding vector network
  • the second sample triplet set of is used to further optimize the calculation effect of the feature vector of the embedding vector network by using cleaner sample triplets.
  • Fig. 7 shows the training method of the retrieval model provided by an exemplary embodiment of the present application, and the method is applied to the computer equipment shown in Fig. 2 as an example for illustration, wherein step 320, step 340 and step 360 are all the same as those in Fig. 3 Consistent with that shown.
  • step 381 train the quantization index network based on the first set of sample triples, including:
  • Step 381-1 input the first sample triplet set into the quantization index network to obtain n 1 third error losses;
  • the first sample triplet set includes n1 sample triplets, and the first sample triplet is obtained based on the first error loss sorting of n sets of triplet feature vectors output by the embedding vector network 401, For details, refer to the above "for step 340".
  • n 1 third error losses including:
  • the second triplet loss of the triplet feature vector is calculated by the quantization index network, where the triplet feature vector is the feature vector output by the embedding vector network ; For each sample triplet of the first sample triplet set, calculate the second quantization error loss of the triplet feature vector through the quantization index network; weight the second triplet loss and the second quantization error loss Summed to get the third error loss.
  • the method for calculating n 1 third error losses is similar to the method for calculating n second error losses in step 362 above, and will not be repeated here.
  • Step 381-2 train the quantization index network based on n 1 third error losses.
  • the computer device trains the quantized indexing network based on n 1 third error losses.
  • step 382 train the embedding vector network based on the second set of sample triples, comprising:
  • Step 382-1 input the second sample triplet set into the embedding vector network to obtain n 2 fourth error losses;
  • the second sample triplet set includes n 2 sample triplets, and the second sample triplet is obtained based on the second error loss sorting of the n groups of triplet quantization indexes output by the quantization index network 402. For details, refer to "For step 360" above.
  • n 2 fourth error losses including:
  • the fourth distance between the feature vector of the training sample and the feature vector of the negative sample is calculated by embedding vector network
  • the method for calculating the n 2 fourth error losses is similar to the method for calculating the n first error losses in step 342 above, and will not be repeated here.
  • Step 382-2 based on n 2 fourth error losses, train the embedding vector network.
  • the computer device trains the embedding vector network based on n 2 fourth error losses.
  • using the first sample triplet set to train the quantized index network and the second sample triplet set to train the embedded vector network optimizes the quantization effect of the quantized index network and the eigenvector calculation effect of the embedded vector network .
  • FIG. 8 shows a retrieval model training system provided by an exemplary embodiment of the present application, which adds a basic feature network 403 to the retrieval model training system shown in FIG. 4 .
  • step 320 and step 340 Based on the training method of the retrieval model shown in FIG. 3 , between step 320 and step 340 , it also includes: acquiring the basic feature vectors of n sample triplets through the basic feature network.
  • the basic feature network 403 is used to extract the image basic features of the input sample triplet, and the basic features include but not limited to color features, texture features, shape features and spatial relationship features.
  • Fig. 9 shows a schematic diagram of a method for training a retrieval model provided by an exemplary embodiment of the present application. The method is applied to the computer device shown in Fig. 2 for illustration. The method includes:
  • Step 320 obtaining n sample triplets for training the retrieval model
  • a computer device acquires n sample triplets for training a retrieval model.
  • step 320 For a detailed description of step 320, refer to the above-mentioned "for step 320".
  • Step 330 obtain the basic feature vectors of n sample triplets through the basic feature network
  • the computer device obtains basic feature vectors of n sample triplets.
  • Step 341 obtaining n sets of triplet feature vectors output by the embedding vector network for n sample triplets
  • the computer device obtains n sets of triplet feature vectors output by the embedding vector network for the n sample triplets.
  • Step 342 calculating n first error losses corresponding to n sets of triplet feature vectors
  • the computer device calculates n first error losses corresponding to n groups of triplet feature vectors.
  • Step 343 From the sorting results of the n first error losses from small to large, filter out the sample triplets corresponding to the n 1 first error losses sorted in the first selection range, and add them to the training quantization The set of first-sample triples for the index network.
  • the computer equipment screens out the sample triplets corresponding to the n 1 first error losses sorted in the first selection range, and adds them to the quantization index used for training The set of first-sample triplets for the network.
  • step 341 For detailed descriptions of step 341 , step 342 and step 343 , reference may be made to the embodiment shown in FIG. 5 .
  • Step 381-1 input the first sample triplet set into the quantization index network to obtain n 1 third error losses;
  • the computer device inputs the first sample triplet set into the quantization index network to obtain n 1 third error losses.
  • n 1 third error losses including:
  • the computer device For each sample triplet of the first sample triplet set, the computer device calculates a second triplet loss of the triplet feature vector; for each sample triplet of the first sample triplet set, The computer device calculates the second quantization error loss of the triplet feature vector; the computer device performs weighted summation on the second triplet loss and the second quantization error loss to obtain the third error loss.
  • Step 381-2 based on n 1 third error losses, train the quantization index network
  • the computer device trains the quantized indexing network based on n 1 third error losses.
  • Step 361 obtaining n groups of triplet quantization indexes output by the quantization index network for n sample triplets
  • the computer equipment inputs the basic feature vectors of n sample triplets into the quantization index network, and outputs n sets of triplet quantization indexes.
  • Step 362 calculating n second error losses corresponding to n sets of triplet quantization indices
  • the method for computer equipment to calculate n second error losses corresponding to n groups of triplet quantization indexes includes:
  • the computer device calculates the first triplet loss of the triplet quantization indexes; for each group of triplet quantization indexes, the computer device calculates the first quantization error loss of the triplet quantization indexes; the computer The device performs weighted summation on the first triplet loss and the first quantization error loss to obtain the second error loss.
  • Step 363 in the sorting results of the n second error losses from small to large, filter out the sample triplets corresponding to the n 2 second error losses sorted in the second selection range, and add them to the training embedding a second set of sample triplets for the vector network;
  • the computer equipment selects the sample triplets corresponding to the n 2 second error losses sorted in the top y% from the sorting results of the n second error losses from small to large, and adds them to the training quantization index network. Second set of sample triplets.
  • step 361, step 362, and step 363 reference may be made to the embodiment shown in FIG. 6 .
  • Step 382-1 input the second sample triplet set into the embedding vector network to obtain n 2 fourth error losses;
  • the computer device inputs the second set of sample triples into the embedding vector network to obtain n 2 fourth error losses.
  • Step 382-2 based on n 2 fourth error losses, train the embedding vector network.
  • the computer device trains the embedding vector network based on n 2 fourth error losses.
  • step 381-1 For detailed descriptions of step 381-1, step 381-2, step 382-1, and step 382-2, reference may be made to the embodiment shown in FIG. 7 .
  • training the embedding vector network and the quantization index network through sample triplets not only uses the first sample triplet set filtered by the embedding vector network as the sample triplets for training the quantization index network, but also passes
  • the second set of triplets of samples screened by the quantitative index network is used as the triplets of samples for training the embedding vector network, and the above method enables the embedding vector network to support the calculation of the distance between the feature vectors of the m candidate images and the feature vectors of the query image, It also makes the quantitative index of the query image obtained by the quantitative index network more accurate.
  • the training method of the above-mentioned retrieval model eliminates triplets of noise samples, and at the same time makes the prediction effect of the quantized index network and the embedded vector network similar to positive and negative samples, predicts triplets of noise samples through dual branches, and makes the two branches learn from each other with excellent performance The sample triplets of , realize denoising learning, so that the dual branches have similar prediction effects.
  • the basic feature network is a resnet101 network (a convolutional neural network), that is, the basic feature network can be trained using the resnet101 network, and the specific parameters are shown in Table 1 below, wherein the Conv1- Conv5 uses the parameters of ResNet101 pre-trained on the ImageNet (large-scale general object recognition open source data set) data set.
  • the basic feature network can also use the resnet18CNN network.
  • the embedding vector network For the embedding vector network, the parameter training in the following table 2 is used.
  • the embedding vector network can also be called the Embedding network.
  • the embedding vector network is initialized with a Gaussian distribution with a variance of 0.01 and a mean value of 0.
  • the embedding vector network can also use Multi-layer Fc connection, embedding vector network output 64-dimensional vector.
  • the quantized index network For the quantized index network, the parameters shown in Table 3 below are used for training.
  • the quantized index network is initialized with a Gaussian distribution with a variance of 0.01 and a mean of 0.
  • Setting learning parameters When updating the retrieval model, the underlying basic features need to be updated when training the embedding vector network in the first stage. Setting the learning parameters is shown in Table 1 and Table 2. When training the quantized index network in the second stage, there is no need to update the base feature network and embedding vector network.
  • Set model parameter update Use SGD (a gradient descent method) stochastic gradient descent method to perform gradient backward calculation on the loss function value obtained in the previous batch to obtain the updated value of the parameters of the embedded vector network and the quantized index network, and update network.
  • SGD a gradient descent method
  • stochastic gradient descent method to perform gradient backward calculation on the loss function value obtained in the previous batch to obtain the updated value of the parameters of the embedded vector network and the quantized index network, and update network.
  • implementing object retrieval includes the following steps: 1. Obtaining the basic feature vector of the query object; 2. Inputting the basic feature vector into the quantization index network and the embedded vector network; 3. Obtain the quantization index of the query object through the quantization index network, and obtain the feature vector of the query object through the embedding vector network; 4. Based on the quantization index, obtain the feature vectors of m candidate objects from the quantization codebook; the quantization codebook stores Quantify the mapping relationship between the index and the feature vectors of the m candidate objects, where m is a positive integer; 5.
  • Fig. 10 is a flowchart of an image retrieval method provided by an exemplary embodiment of the present application. In this embodiment, the method is executed by the computer device shown in FIG. 2 for illustration, and the method includes:
  • Step 1001 obtaining the basic feature vector of the query image
  • a query image is an image used for image retrieval.
  • the retrieval model further includes a basic feature network through which the basic feature vectors of n sample triplets are obtained.
  • FIG. 11 shows the retrieval model provided by an exemplary embodiment of the present application. A schematic diagram of the system using the model, where the query image 1101 is input into the basic feature network 403, and the basic feature vector of the query image 1101 can be obtained.
  • Step 1002 inputting the basic feature vector into the quantization index network and the embedding vector network;
  • the basic feature vector of the query image 1101 is input to the quantization index network 402 and the embedding vector network 401 .
  • Step 1003 obtain the quantization index of the query image through the quantization index network, and obtain the feature vector of the query image through the embedding vector network;
  • the quantization index (1, 0, 0) of the query image 1101 is obtained through the quantization index network 402, and the feature vector (0.2, 0.8, 0.3, 0.3) of the query image 1101 is obtained through the embedding vector network 401.
  • Step 1004 based on the quantization index, obtain the feature vectors of m candidate images from the quantization codebook index;
  • the quantization codebook stores the mapping relationship between quantization indices and feature vectors of m candidate images, where m is a positive integer.
  • the feature vectors of three candidate images are obtained from the quantization codebook index: (0.2, 0.7, 0.3, 0.3), (0.1, 0.5, 0.2, 0.2) and (0.2, 0.4, 0.2, 0.3).
  • Fig. 11 also shows the process of constructing the quantization codebook provided by an exemplary embodiment of the present application. All images in the image library are subjected to feature extraction to obtain quantization indexes and feature vectors.
  • the quantized codebook is constructed by the following steps:
  • the feature vector e can be obtained by the embedding vector network 401
  • the quantization index q can be obtained by the quantization index network 402 (q is obtained by a sign function, and each dimension is 0 or 1 vector)
  • the list Lindex of the quantization index, the mapping table Linvert of q and the T list together form a quantization codebook.
  • indexing and obtaining the feature vectors of the m candidate images from the quantization codebook includes: calculating the quantization index of the query image, and determining from the list Lindex that the Hamming distance with the quantization index of the query image is less than several quantization indexes of the threshold, and determine m candidate images corresponding to several quantization indexes from the mapping table Linvert, and determine the feature vectors of the m candidate images through the T mapping table.
  • Step 1005 respectively calculating fifth distances between the feature vectors of the m candidate images and the feature vectors of the query image to obtain m fifth distances;
  • the fifth distance is the Euclidean distance.
  • Step 1006 among the sorting results of the m fifth distances from small to large, according to the preset z value, filter out the candidate images corresponding to the fifth distances ranked in the top z%.
  • the retrieval model obtains, optionally, the z value can be reasonably set according to the retrieval requirements, so that the screened candidate images meet the retrieval expectations of the retrieval model.
  • z is a positive number.
  • the computer device also sends the screened candidate images to a client running an image retrieval function.
  • the above method can sort the distance between the query image and m candidate images and filter out the top-ranked images through the retrieval model including the basic feature network, embedding vector network and quantitative index network.
  • the above method not only realizes In order to make the screened image closer to the query image, it also avoids losing or adding additional candidate images when determining m candidate images.
  • Fig. 12 is a flowchart of an audio retrieval method provided by an exemplary embodiment of the present application.
  • the method is executed by the computer device shown in FIG. 2 for illustration, and the method includes:
  • Step 1201 obtain the basic feature vector of the query audio
  • the query audio is the audio used for audio retrieval.
  • the retrieval model further includes a basic feature network, through which basic feature vectors of n sample triplets are obtained.
  • Step 1202 input the basic feature vector to the quantization index network and the embedding vector network;
  • the basic feature vector of the query audio is input to the quantization index network and the embedding vector network.
  • Step 1203 obtain the quantization index of the query audio through the quantization index network, and obtain the feature vector of the query audio through the embedding vector network;
  • the quantization index (1, 0, 0) of the query audio is obtained through the quantization index network
  • the feature vector (0.2, 0.8, 0.3, 0.3) of the query audio is obtained through the embedding vector network.
  • Step 1204 based on the quantization index, obtain m feature vectors of candidate audio from the quantization codebook index;
  • the quantization codebook stores the mapping relationship between the quantization index and feature vectors of m candidate audios, where m is a positive integer.
  • the feature vectors of three candidate audios are obtained from the quantization codebook index: (0.2, 0.7, 0.3, 0.3), (0.1, 0.5, 0.2, 0.2) and (0.2, 0.4, 0.2, 0.3).
  • the quantized codebook is constructed by the following steps:
  • the feature vector e can be obtained by the embedding vector network 401
  • the quantization index q can be obtained by the quantization index network 402 (q is obtained by a sign function and each dimension is 0 or 1 vector)
  • record the mapping table T[i:e] of the audio i embedding vector network 401 (where i represents the serial number of the audio, and e represents the feature vector output by the audio i via the embedding vector network).
  • the list Lindex of the quantization index, the mapping table Linvert of q and the T list together form a quantization codebook.
  • indexing and obtaining the feature vectors of the m candidate audios from the quantization codebook includes: calculating the quantization index of the query audio, and determining from the list Lindex that the Hamming distance with the quantization index of the query audio is less than several quantization indices of the threshold, and determine m candidate audios corresponding to several quantization indices from the mapping table Linvert, and determine the feature vectors of the m candidate audios through the T mapping table.
  • Step 1205 respectively calculating the fifth distances between the feature vectors of the m candidate audios and the feature vectors of the query audio to obtain m fifth distances;
  • the feature vectors (0.2, 0.8, 0.3, 0.3) of the query audio 1101 and the feature vectors of three candidate audios are calculated respectively: (0.2, 0.7, 0.3, 0.3), (0.1, 0.5, 0.2, 0.2) and A fifth distance of (0.2, 0.4, 0.2, 0.3), resulting in 3 fifth distances.
  • the fifth distance is the Euclidean distance.
  • Step 1206 among the sorting results of the m fifth distances from small to large, according to the preset z value, filter out the candidate audios corresponding to the fifth distances in the top z%.
  • the filter out the candidate audio corresponding to the fifth distance in the top z%, where z% can be retrieved by pre-configuration The model is obtained.
  • the z value can be reasonably set according to the retrieval requirements, so that the selected audio candidates meet the retrieval expectations of the retrieval model. z is a positive number.
  • the computer device also sends the screened audio candidates to the client running the audio retrieval function.
  • the above method can sort the distance between the query audio and m candidate audio through the retrieval model including the basic feature network, the embedding vector network and the quantized index network, and filter out the top-ranked audio.
  • the above method not only realizes In order to make the filtered audio closer to the query audio, it also avoids losing or additionally adding candidate audio when m candidate audio are determined.
  • Fig. 13 is a flowchart of a video retrieval method provided by an exemplary embodiment of the present application.
  • the method is executed by the computer device shown in FIG. 2 for illustration, and the method includes:
  • Step 1301 obtaining the basic feature vector of the query video
  • the query video is the video used for video retrieval.
  • the retrieval model further includes a basic feature network, through which basic feature vectors of n sample triplets are obtained.
  • Step 1302 input the basic feature vector to the quantization index network and the embedding vector network;
  • the basic feature vector of the query video is input to the quantization index network and the embedding vector network.
  • Step 1303 obtain the quantized index of the query video through the quantized index network, and obtain the feature vector of the query video through the embedded vector network;
  • the quantization index (1, 0, 0) of the query video is obtained through the quantization index network
  • the feature vector (0.2, 0.8, 0.3, 0.3) of the query video is obtained through the embedding vector network.
  • Step 1304 based on the quantization index, obtain the feature vectors of m candidate videos from the quantization codebook index;
  • the quantization codebook stores mapping relationships between quantization indices and feature vectors of m candidate videos, where m is a positive integer.
  • the feature vectors of three candidate videos are obtained from the quantization codebook index: (0.2, 0.7, 0.3, 0.3), (0.1, 0.5, 0.2, 0.2) and (0.2, 0.4, 0.2, 0.3).
  • the quantized codebook is constructed by the following steps:
  • the feature vector e can be obtained by the embedding vector network 401
  • the quantization index q can be obtained by the quantization index network 402 (q is obtained by a symbolic function and each dimension is 0 or 1 vector)
  • the list Lindex of the quantization index, the mapping table Linvert of q and the T list together form a quantization codebook.
  • indexing and obtaining the feature vectors of the m candidate videos from the quantization codebook includes: calculating the quantization index of the query video, and determining from the list Lindex that the Hamming distance with the quantization index of the query video is less than several quantization indexes of the threshold, and determine m candidate videos corresponding to several quantization indexes from the mapping table Linvert, and determine the feature vectors of the m candidate videos through the T mapping table.
  • Step 1305 respectively calculating the fifth distance between the feature vectors of the m candidate videos and the feature vector of the query video to obtain m fifth distances;
  • Exemplary calculate the feature vector (0.2,0.8,0.3,0.3) of query video 1101 and the feature vector of 3 candidate videos respectively: (0.2,0.7,0.3,0.3), (0.1,0.5,0.2,0.2) and A fifth distance of (0.2, 0.4, 0.2, 0.3), resulting in 3 fifth distances.
  • the fifth distance is the Euclidean distance.
  • Step 1306 among the sorting results of the m fifth distances from small to large, according to the preset z value, filter out the candidate videos corresponding to the fifth distance in the top z%.
  • the candidate videos corresponding to the fifth distance in the top z% are screened out, wherein z% can be retrieved by pre-configuration
  • the model is obtained.
  • the z value can be reasonably set according to the retrieval requirements, so that the screened candidate videos meet the retrieval expectations of the retrieval model.
  • z is a positive number.
  • the computer device also sends the screened candidate videos to a client running a video retrieval function.
  • the above method can sort the distance between the query video and m candidate videos and filter out the top-ranked videos through the retrieval model including the basic feature network, embedding vector network and quantized index network.
  • the above method not only realizes In order to make the filtered video closer to the query video, it also avoids losing or adding additional candidate videos when determining m candidate videos.
  • Fig. 14 is a structural block diagram of a training device for a retrieval model provided by an exemplary embodiment of the present application.
  • the retrieval model includes an embedding vector network and a quantization index network.
  • the embedding vector network is used to obtain feature vectors of images, and the quantization index network is used to extract images.
  • the quantization index, the training device of the retrieval model includes:
  • the obtaining module 1401 is used to obtain n sample triplets for training the retrieval model; the sample triplets include training samples, positive samples that form similar sample pairs with the training samples, and negative samples that do not form similar sample pairs with the training samples.
  • Sample, n is a positive integer greater than 1;
  • the screening module 1402 is used to input the basic feature vectors of n sample triples into the embedding vector network; according to the error of the feature vector output by the embedding vector network, filter out the first set of sample triples for training the quantization index network ;
  • the screening module 1402 is also used to input the basic feature vectors of n sample triples into the quantized index network; according to the error of the quantized index output by the quantized index network, filter out the second set of sample triples used for training the embedded vector network ;
  • the training module 1403 is configured to train the quantization index network based on the first sample triplet set, and train the embedding vector network based on the second sample triplet set.
  • the screening module 1402 is further configured to obtain n sets of triplet feature vectors output by the embedding vector network for n sample triplets.
  • the screening module 1402 is further configured to calculate n first error losses corresponding to n sets of triplet feature vectors.
  • the screening module 1402 is further configured to, among the sorting results of the n first error losses from small to large, filter out n 1 first error losses sorted within the first selection range The corresponding sample triplets are added to the first set of sample triplets used to train the quantization index network.
  • the screening module 1402 is further configured to, for each set of triplet feature vectors, calculate a first distance between the feature vectors of the training samples and the feature vectors of the positive samples.
  • the screening module 1402 is further configured to, for each set of triplet feature vectors, calculate a second distance between the feature vectors of the training samples and the feature vectors of the negative samples.
  • the screening module 1402 is also used to calculate the first error loss between the difference between the first distance and the second distance and the first distance threshold, the first distance threshold is the difference between the training sample and The threshold for the difference between the distance between positive samples and the distance between training samples and negative samples.
  • the screening module 1402 is also used to select n 1 items that are ranked in the top x% according to the preset value of x among the sorting results of the n first error losses from small to large
  • the sample triplets corresponding to the first error loss are added to the first sample triplet set used for training the quantization index network, and n 1 is a positive integer smaller than n.
  • x is a positive number.
  • the screening module 1402 is further configured to obtain n groups of triplet quantization indexes output by the quantization index network for n sample triplets.
  • the screening module 1402 is further configured to calculate n second error losses corresponding to n sets of triplet quantization indices.
  • the screening module 1402 is further configured to select n 2 second error loss results sorted within the second selection range from the n second error loss ranking results in ascending order.
  • the corresponding sample triplet is added to the second sample triplet set used for training the embedding vector network, and n 2 is a positive integer smaller than n.
  • the screening module 1402 is further configured to calculate the first triplet loss of triplet quantization indices for each set of triplet quantization indices.
  • the screening module 1402 is further configured to calculate the first quantization error loss of triplet quantization indexes for each group of triplet quantization indexes.
  • the screening module 1402 is further configured to perform weighted summation of the first triplet loss and the first quantization error loss to obtain the second error loss.
  • the screening module 1402 is also used to screen out the n 2 ranking results in the top y% according to the preset y value among the sorting results of the n second error losses from small to large
  • the sample triplets corresponding to the second error loss are added to the second sample triplet set used for training the quantization index network.
  • y is a positive number.
  • the training module 1403 is further configured to input the first sample triplet set into the quantization index network to obtain n 1 third error losses.
  • the training module 1403 is further configured to train the quantization index network based on n 1 third error losses.
  • the training module 1403 is further configured to, for each sample triplet in the first sample triplet set, calculate a second triplet loss of the triplet feature vector.
  • the training module 1403 is further configured to calculate a second quantization error loss of triplet feature vectors for each sample triplet in the first sample triplet set.
  • the training module 1403 is further configured to perform weighted summation of the second triplet loss and the second quantization error loss to obtain a third error loss.
  • the training module 1403 is further configured to input the second sample triplet set into the embedding vector network to obtain n 2 fourth error losses.
  • the training module 1403 is further configured to train the embedding vector network based on n 2 fourth error losses.
  • the training module 1403 is further configured to, for each sample triplet in the second sample triplet set, calculate a third distance between the feature vector of the training sample and the feature vector of the positive sample.
  • the training module 1403 is further configured to, for each sample triplet in the second sample triplet set, calculate the fourth distance between the feature vector of the training sample and the feature vector of the negative sample .
  • the training module 1403 is also used to calculate the fourth error loss between the difference between the third distance and the fourth distance and the second distance threshold, and the second distance threshold is the difference between the training sample and The threshold for the difference between the distance between positive samples and the distance between training samples and negative samples.
  • the retrieval model also includes a basic feature network.
  • the obtaining module 1401 is further configured to obtain basic feature vectors of n sample triplets through a basic feature network.
  • the above-mentioned retrieval model training device trains the embedded vector network and the quantized index network through sample triplets, and not only uses the first set of sample triples screened by the embedded vector network as the sample triples for training the quantized index network. tuples, and the second set of sample triples obtained through the screening of the quantization index network will be used as the sample triples for training the embedding vector network.
  • the training device of the above retrieval model eliminates the influence of the noise sample triplets
  • the quantitative index network and the embedding vector network have similar prediction effects on positive and negative samples.
  • the noise sample triplet is predicted through the dual branch, and the dual branch learns the excellent sample triplet from each other to realize denoising learning. Branches have similar predictive effects.
  • Fig. 15 shows a structural block diagram of an object retrieval device provided by an exemplary embodiment of the present application, the device includes:
  • Obtaining module 1501 configured to obtain the basic feature vector of the query object
  • the input module 1502 is used to input the basic feature vector to the quantization index network and the embedding vector network;
  • the obtaining module 1501 is also used to obtain the quantized index of the query object through the quantized index network, and obtain the feature vector of the query object through the embedded vector network;
  • the indexing module 1503 is configured to obtain the feature vectors of m candidate objects from the quantization codebook based on the quantization index; the quantization codebook stores the mapping relationship between the quantization index and the feature vectors of the m candidate objects, and m is a positive integer ;
  • a calculation module 1504 configured to calculate the fifth distances between the feature vectors of the m candidate objects and the feature vectors of the query object, to obtain m fifth distances;
  • the screening module 1505 is configured to, among the sorting results of the m fifth distances in ascending order, select the candidate objects corresponding to the fifth distances in the top z% according to the preset z value.
  • z is a positive number.
  • the obtaining module 1501 is further configured to generate a basic feature vector of the query image by retrieving the basic feature network of the model.
  • the above-mentioned retrieval device can sort the distance between the query object and m candidate objects and filter out the top-ranked objects through the retrieval model including the basic feature network, the embedding vector network and the quantized index network.
  • the device not only realizes that the filtered object is closer to the query object, but also avoids losing or additionally adding candidate objects when m candidate objects are determined.
  • the object retrieval training device and object retrieval device provided in the above embodiments are only illustrated by the division of the above-mentioned functional modules. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. , which divides the internal structure of the device into different functional modules to complete all or part of the functions described above.
  • the training device for the retrieval model provided by the above-mentioned embodiments belongs to the same idea as the embodiment of the training method for the retrieval model, and the image retrieval device and the image retrieval method belong to the same concept. For the specific implementation process, see the method embodiment for details, and will not be repeated here. .
  • Fig. 16 shows a structural block diagram of a computer device 1600 provided by an exemplary embodiment of the present application.
  • the computer device can be a terminal or a server. In this embodiment, it can be simply described as the terminal independently training the retrieval model and/or the terminal independently using the retrieval model, or the server independently training the retrieval model and/or the server using the retrieval model alone, or, the terminal
  • the retrieval model is jointly trained with the server and/or the retrieval model is jointly used by the terminal and the server.
  • the computer device 1600 includes: a processor 1601 and a memory 1602 .
  • the processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1601 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • Processor 1601 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, and is also called a CPU (Central Processing Unit, central processing unit); the coprocessor is Low-power processor for processing data in standby state.
  • CPU Central Processing Unit
  • the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1601 may also include an AI processor, which is used to process computing operations related to machine learning.
  • Memory 1602 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 1602 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 1602 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1601 to implement the retrieval model provided by the method embodiments in this application training method or image retrieval method.
  • the computer device 1600 may optionally further include: a peripheral device interface 1603 and at least one peripheral device.
  • the processor 1601, the memory 1602, and the peripheral device interface 1603 may be connected through buses or signal lines.
  • Each peripheral device can be connected to the peripheral device interface 1603 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1604 , a display screen 1605 , a camera component 1606 , an audio circuit 1607 , a positioning component 1608 and a power supply 1609 .
  • the peripheral device interface 1603 may be used to connect at least one peripheral device related to I/O (Input/Output, input/output) to the processor 1601 and the memory 1602 .
  • the processor 1601, memory 1602 and peripheral device interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1601, memory 1602 and peripheral device interface 1603 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 1604 is used to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • RF Radio Frequency, radio frequency
  • the display screen 1605 is used to display a UI (User Interface, user interface).
  • UI User Interface, user interface
  • the camera assembly 1606 is used to capture images or videos.
  • Audio circuitry 1607 may include a microphone and speakers.
  • the power supply 1609 is used to supply power to various components in the computer device 1600 .
  • computing device 1600 also includes one or more sensors 1610 .
  • the one or more sensors 1610 include, but are not limited to: an acceleration sensor 1611 , a gyroscope sensor 1612 , a pressure sensor 1613 , a fingerprint sensor 1614 , an optical sensor 1615 and a proximity sensor 1616 .
  • FIG. 16 does not constitute a limitation to the computer device 1600, and may include more or less components than shown in the figure, or combine certain components, or adopt a different component arrangement.
  • the present application also provides a computer-readable storage medium, wherein at least one instruction, at least one program, code set or instruction set is stored in the storage medium, and the at least one instruction, the at least one program, the code set or The instruction set is loaded and executed by the processor to implement the retrieval model training method or the object retrieval method provided by the above method embodiments.
  • the present application provides a computer program product or computer program, the computer program product or computer program comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the retrieval model training method or the object retrieval method provided by the above method embodiments.
  • the above sample triplet, the basic feature vector output by the basic feature network, the feature vector output by the embedding vector network, and the quantization index output by the quantization index network can be stored in the nodes of the data sharing system.
  • the data sharing system shown in Figure 17 the data sharing system 1700 refers to a system for sharing data between nodes, the data sharing system may include multiple nodes 1701, and the multiple nodes 1701 may refer to data sharing each client in the system.
  • Each node 1701 can receive input information during normal operation, and maintain the shared data in the data sharing system based on the received input information.
  • each node in the data sharing system there is a node ID corresponding to it, and each node in the data sharing system can store the node IDs of other nodes in the data sharing system, so that later, according to the node IDs of other nodes, Broadcast the generated blocks to other nodes in the data sharing system.
  • Each node can maintain a node ID list as shown in the following table, and store the node name and node ID in the node ID list.
  • the node identifier can be an IP (Internet Protocol, a protocol for interconnection between networks) address and any other information that can be used to identify the node. In Table 4, only the IP address is used as an example for illustration.
  • Every node in the data sharing system stores an identical blockchain.
  • the blockchain is composed of multiple blocks. See Figure 18.
  • the blockchain is composed of multiple blocks.
  • the founding block includes a block header and a block body.
  • the block header stores input information characteristic values, version numbers, and timestamps.
  • the input information is stored in the block body; the next block of the founding block takes the founding block as the parent block, and the next block also includes the block header and the block body, and the current block is stored in the block header
  • the stored block data is associated, which ensures the security of the input information in the block.
  • SHA256 is the eigenvalue algorithm used to calculate the eigenvalue
  • version (version number) is the version information of the relevant block protocol in the blockchain
  • prev_hash is the block header eigenvalue of the parent block of the current block
  • merkle_root is the input information
  • ntime is the update time of the update timestamp
  • nbits is the current difficulty, which is a fixed value for a period of time, and will be determined again after a fixed period of time
  • x is a random number
  • TARGET is the threshold of the feature value, the feature The value threshold can be determined according to nbits.
  • the node where the blockchain is located sends the newly generated blocks to other nodes in the data sharing system where it is located, and the other nodes verify the newly generated blocks , and add the newly generated block to its stored blockchain after the verification is completed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种检索模型的训练和检索方法、装置、设备及介质,属于人工智能领域。检索模型包括嵌入向量网络和量化索引网络,前述方法包括:获取用于训练检索模型的n个样本三元组(320);将n个样本三元组的基础特征向量输入嵌入向量网络;根据嵌入向量网络输出的特征向量的误差,筛选出用于训练量化索引网络的第一样本三元组集合(340);将n个样本三元组的基础特征向量输入量化索引网络;根据量化索引网络输出的量化索引的误差,筛选出用于训练嵌入向量网络的第二样本三元组集合(360);基于第一样本三元组集合训练量化索引网络(381);基于第二样本三元组集合训练嵌入向量网络(382)。上述方法提高了检索模型的准确率。

Description

检索模型的训练和检索方法、装置、设备及介质
本申请要求于2021年08月17日提交的申请号为202110945262.5、发明名称为“图像检索模型的训练和检索方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,特别涉及一种检索模型的训练和检索方法、装置、设备及介质。
背景技术
在基于查询图像进行图像检索时,往往通过召回多个候选图像进行排序选取置信度较高的候选图像作为检索结果。
相关技术中,常常先通过嵌入向量网络获取查询图像的特征向量,对特征向量进行PQ量化(Product Quantization,乘积量化)得到量化索引,再从量化码本中找到与量化索引匹配的m个特征向量,召回与m个特征向量对应的m个候选图像,再根据m个候选图像的特征向量与查询图像的特征向量之间的距离排序结果,选取排名较高的候选图像作为最终召回的图像。
相关技术中,PQ量化把特征向量的每个维度的数值切分成多段,每段用不同的数码表征(如某位值为0~1间的浮点数,则可能切分为0.1、…0.9、1.0共10段,分别用1~10的数字表示每一段的量化方法),检索时将量化到相同段的候选图像召回。然而,PQ量化的方法容易造成相似的特征被割裂到两个相邻的段中,临界样本容易漏召回或多召回(如段1、段2的临界样本可能跟段1相似也可能跟段2相似,而单独召回任何一段都存在漏召回,而召回两段又会使误召回增多)。
发明内容
本申请提供了一种检索模型的训练和检索方法、装置、设备及介质,能够提高检索的准确率。所述技术方案如下:
根据本申请的一个方面,提供了一种检索模型的训练方法,该方法由计算机设备执行,检索模型包括嵌入向量网络和量化索引网络,嵌入向量网络用于获取检索对象的特征向量,量化索引网络用于提取检索对象的量化索引;所述方法包括:
获取用于训练检索模型的n个样本三元组;样本三元组包括训练样本、与训练样本构成相似样本对的正样本、以及与训练样本不构成相似样本对的负样本;n为大于1的正整数;
将n个样本三元组的基础特征向量输入嵌入向量网络;根据嵌入向量网络输出的特征向量的误差,筛选出用于训练量化索引网络的第一样本三元组集合;
将n个样本三元组的基础特征向量输入量化索引网络;根据量化索引网络输出的量化索引的误差,筛选出用于训练嵌入向量网络的第二样本三元组集合;
基于第一样本三元组集合训练量化索引网络,以及基于第二样本三元组集合训练嵌入向量网络。
根据本申请的另一方面,提供了一种对象检索方法,该方法由计算机设备执行,所述方法包括:
获取查询对象的基础特征向量;
将基础特征向量输入至量化索引网络和嵌入向量网络;
通过量化索引网络获取查询对象的量化索引,以及通过嵌入向量网络获取查询对象的特 征向量;
基于量化索引,从量化码本中索引得到m个候选对象的特征向量;量化码本存储有量化索引与m个候选对象的特征向量之间的映射关系;m为正整数;
分别计算m个候选图像的特征向量与查询对象的特征向量的第五距离,得到m个第五距离;
在m个第五距离由小到大的排序结果中,筛选出排序在前z%的第五距离对应的候选图像。
根据本申请的另一方面,提供了一种检索模型的训练装置,检索模型包括嵌入向量网络和量化索引网络,嵌入向量网络用于获取检索对象的特征向量,量化索引网络用于提取检索对象的量化索引;装置包括:
获取模块,用于获取用于训练检索模型的n个样本三元组;样本三元组包括训练样本、与训练样本构成相似样本对的正样本、以及与训练样本不构成相似样本对的负样本;
筛选模块,用于将n个样本三元组的基础特征向量输入嵌入向量网络;根据嵌入向量网络输出的特征向量,筛选出用于训练量化索引网络的第一样本三元组集合;
筛选模块,还用于将n个样本三元组的基础特征向量输入量化索引网络;根据量化索引网络输出的量化索引,筛选出用于训练嵌入向量网络的第二样本三元组集合;
训练模块,用于基于第一样本三元组集合训练量化索引网络,以及基于第二样本三元组集合训练嵌入向量网络。
根据本申请的另一方面,提供了一种对象检索装置,所述装置包括:
获取模块,用于获取查询对象的基础特征向量;
输入模块,用于将基础特征向量输入至量化索引网络和嵌入向量网络;
获取模块,还用于通过量化索引网络获取查询对象的量化索引,以及通过嵌入向量网络获取查询对象的特征向量;
索引模块,用于基于量化索引,从量化码本中索引得到m个候选对象的特征向量;量化码本存储有量化索引与m个候选对象的特征向量之间的映射关系;
计算模块,用于分别计算m个候选对象的特征向量与查询对象的特征向量的第五距离,得到m个第五距离;
筛选模块,用于在m个第五距离由小到大的排序结果中,筛选出排序在前z%的第五距离对应的候选对象。
根据本申请的一个方面,提供了一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现如上所述的检索模型的训练方法,和/或,对象检索方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序由处理器加载并执行以实现如上所述的检索模型的训练方法,和/或,对象检索方法。
根据本申请的另一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述检索模型的训练方法,和/或,对象检索方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过样本三元组训练嵌入向量网络和量化索引网络,不仅将通过嵌入向量网络筛选得到的第一样本三元组集合作为训练量化索引网络的样本三元组,还将通过量化索引网络筛选得到的第二样本三元组集合作为训练嵌入向量网络的样本三元组。由于不同网络的噪声识别能力不同,嵌入向量网络和量化索引网络能够通过筛选处理去掉样本三元组中的噪声,并使用第一样本三元组集合训练量化索引网络,让量化索引网络能够学习到嵌入向量网络的噪声识 别能力,使用第二样本三元组集合训练嵌入向量网络,让嵌入向量网络能够学习到量化索引网络的噪声识别能力,促进嵌入向量网络和量化索引网络的噪声识别能力的共同进步。因此,上述检索模型的训练方法剔除了样本三元组中噪声样本三元组的影响,同时使得量化索引网络与嵌入向量网络对正负样本的预测效果相似,通过双分支预测噪声样本三元组,并使得双分支彼此学习表现优异的样本三元组,实现去噪学习,使得双分支具有相似的预测效果。
附图说明
图1是本申请一个示例性实施例提供的PQ量化的示意图;
图2是本申请一个示例性实施例的检索模型的实施环境的示意图;
图3是本申请一个示例性实施例的检索模型的训练方法的流程图;
图4是本申请一个示例性实施例的检索模型的训练系统的示意图;
图5是本申请另一个示例性实施例的检索模型的训练方法的流程图;
图6是本申请另一个示例性实施例的检索模型的训练方法的流程图;
图7是本申请另一个示例性实施例的检索模型的训练方法的流程图;
图8是本申请另一个示例性实施例的检索模型的训练系统的示意图;
图9是本申请另一个示例性实施例的检索模型的训练方法的流程图;
图10是本申请的一个示例性实施例提供的图像检索方法的流程图;
图11是本申请一个示例性实施例提供的检索模型的使用系统的示意图;
图12是本申请的一个示例性实施例提供的音频检索方法的流程图;
图13是本申请的一个示例性实施例提供的视频检索方法的流程图;
图14是本申请一个示例性实施例提供的检索模型的训练装置的结构框图;
图15是本申请一个示例性实施例提供的对象检索装置的结构框图;
图16是本申请一个示例性实施例提供的计算机设备的结构框图;
图17是本申请一个示例性实施例提供的数据共享系统的示意图;
图18是本申请一个示例性实施例提供的区块链结构的示意图;
图19是本申请一个示例性实施例提供的新区块生成过程的示意图。
具体实施方式
首先,对本申请实施例中涉及的名词进行简单介绍:
PQ量化:在训练阶段,参考图1,针对N个训练样本,假设样本维度为128维,将其切分为4个子空间,则每一个子空间的维度为32维,在每一个子空间中,对特征向量采用k-means(一种聚类算法)对其进行聚类(图1所示256个聚类中心),针对每一个子空间都能得到一个码本。且每个训练样本的每个子段,都可以用子空间的聚类中心来近似,对应的编码即为聚类中心的数值。最终得到的N个训练样本构成一个索引码本。对于待量化的样本,将其进行相同的切分,然后在各个子空间里逐一找到距离它们最近的聚类中心,然后用聚类中心的数值来表示每个子段,即完成了待量化样本的索引向量。
在检索阶段,将查询图像的特征向量分成4个子段,然后在每个子空间中,计算子段到该子空间中所有聚类中心的距离,可以得到4*256个距离,将这些算好的距离作为距离表。在计算某个样本到查询向量的距离时,比如编码为(124,56,132,222)这个样本到查询向量的距离时,我们分别到距离表中取各个子段对应的距离即可,比如编码为124这个子段,在第1个算出的256个距离里面把编号为124的那个距离取出来就可,所有子段对应的距离取出来后,将这些子段的距离求和相加,即得到该样本到查询样本间的非对称距离。所有距离算好后,召回距离排序靠前的样本。
在本申请中,考虑到PQ量化的方式计算速度较慢,且容易造成相似的基础特征被割裂到两个相邻的聚类中心,采用构建深度符号量化网络的方式。
深度符号量化网络(量化索引网络):首先,对于N个训练样本的D维特征向量,向量 归一化后每维的取值范围为-1~1的浮点数,把D维特征向量压缩到指定位数取值为0、1的二进制码,为符号量化。如,对4维的特征向量归一化后得到(-1,1,0.5,-0.2),符号量化后得到(0,1,1,0)量化索引。
计算机视觉技术(Computer Vision,CV):计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR(Optical Character Recognition,计算机文字识别)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
人工智能云服务:所谓人工智能云服务,一般也被称作是AIaaS(AI as a Service,中文为“AI即服务”)。这是目前主流的一种人工智能平台的服务方式,具体来说AIaaS平台会把几类常见的AI(Artificial Intelligence,人工智能)服务进行拆分,并在云端提供独立或者打包的服务。这种服务模式类似于开了一个AI主题商城:所有的开发者都可以通过API(Application Program Interface,应用程序接口)接口的方式来接入使用平台提供的一种或者是多种人工智能服务,部分资深的开发者还可以使用平台提供的AI框架和AI基础设施来部署和运维自已专属的云人工智能服务。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的检索对象都是在充分授权的情况下获取的。
本申请实施例的方案包括检索模型的训练阶段和检索阶段,图2示出了本申请一个示例性实施例提供的检索模型的训练方法和使用方法所在的计算机系统200。其中,终端220用于训练检索模型和/或使用检索模型,服务器240用于训练检索模型和/或使用检索模型。
终端220安装和运行有支持检索对象的客户端。该客户端可以是支持检索对象的应用程序、网页和小程序中的任意一种。示例性的,被检索的对象包括但不限于图像、视频、音频、文本中的至少一种。终端220可以训练检索模型和/或使用检索模型。可选的,终端220上安装的客户端是操作系统平台(安卓或IOS)上的客户端。终端220可以泛指多个终端中的一个,本实施例仅以终端220举例说明。终端220的设备类型包括:智能手机、智能手表、智能电视、平板电脑、电子书阅读器、MP3播放器、MP4播放器、膝上型便携计算机和台式计算机中的至少一种。本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量。本申请实施例对终端的数量和设备类型不加以限定。
终端220通过无线网络或有线网络与服务器240相连。
服务器240包括一台服务器、多台服务器、云计算平台和虚拟化中心中的至少一种。示例性的,服务器240包括处理器244和存储器242,存储器242又包括接收模块2421、控制模块2422和发送模块2423,接收模块2421用于接收客户端发送的请求,如训练检索模型和使用检索模型;控制模块2422用于控制检索模型的训练和使用;发送模块2423用于向终端发送响应,如向客户端返回检索得到的对象,或返回训练完成的检索模型。服务器240用于为对象检索和/或检索模型训练提供后台服务。可选地,服务器240承担主要计算工作,终端220承担次要计算工作;或者,服务器240承担次要计算工作,终端220和第二终端160承担主要计算工作;或者,服务器240、终端220之间采用分布式计算架构进行协同计算。
下述以检索模型的训练方法、检索方法可以由计算机设备执行进行举例说明,可选的, 计算机设备可以为终端或服务器,本申请对检索模型的训练方法和检索方法的执行主体并不加以限制。
为实现通过构建深度量化网络进行对象检索,图3是本申请一个示例性实施例的检索模型的训练方法的流程图。本实施例以该方法由图2所示的计算机设备来执行进行举例说明,其中,检索模型包括嵌入向量网络和量化索引网络,嵌入向量网络用于获取检索对象的特征向量,量化索引网络用于提取检索对象的量化索引,所述方法包括:
步骤320,获取用于训练检索模型的n个样本三元组;
其中,样本三元组包括训练样本、与训练样本构成相似样本对的正样本、以及与训练样本不构成相似样本对的负样本,n为大于1的正整数。
图像检索:通过查询图像的特征向量,得到量化索引,再从量化码本中找到与量化索引匹配的m个特征向量,计算m个特征向量与查询图像的特征向量的距离,选取与排名靠前的距离对应的图像作为最终筛选出的图像,m为正整数。
音频检索:通过查询音频的特征向量,得到量化索引,再从量化码本中找到与量化索引匹配的a个特征向量,计算a个特征向量与查询音频的特征向量的距离,选取与排名靠前的距离对应的音频作为最终筛选出的音频,a为正整数。
视频检索:通过查询视频的特征向量,得到量化索引,再从量化码本中找到与量化索引匹配的b个特征向量,计算b个特征向量与查询视频的特征向量的距离,选取与排名靠前的距离对应的视频作为最终筛选出的视频,b为正整数。
量化索引:指量化得到的用于检索出检索对象的索引。在一些实施例中,以图像为例,一个量化索引对应一类图像,比如,所有的样本图像的图像类别分为:人物类别和动物类别,则存在一个量化索引与人物类型相对应,存在另一个量化索引与动物类别相对应。示例性的,样本图像包括狗图像和猫图像,其中,狗图像的量化索引是(1,0,1),猫图像的量化索引是(1,1,1),若检索图像的量化检索是(1,0,1),则可以通过量化检索确定该检索图像属于狗图像。
量化索引网络:指用于提取检索对象的量化索引的神经网络。
嵌入特征网络:指用于获取检索对象的特征向量的神经网络。可选地,嵌入特征网络对输入的基础特征向量进行二次编码处理,得到三元组特征向量。示例性的,输入的基础特征向量的尺寸为1×2048,嵌入特征网络对输入的基础特征向量进行二次编码处理,减小基础特征向量的尺寸,得到三元组特征向量,三元组特征向量的尺寸为1×64。
可选地,检索对象包括但不限于图像、视频和音频中的至少一种。本申请实施例以检索对象对图像为例进行说明。
需要说明的是,为了更好地训练量化索引网络和尽可能地让检索模型在量化索引和特征向量的距离计算上表现优异,往往采用样本三元组对检索模型进行训练。
步骤340,将n个样本三元组的基础特征向量输入嵌入向量网络;根据嵌入向量网络输出的特征向量的误差,筛选出用于训练量化索引网络的第一样本三元组集合;
基础特征向量:指样本三元组内的图像基础特征的向量,图像基础特征包括但不限于颜色特征、纹理特征、形状特征和空间关系特征。
其中,嵌入向量网络由n个样本三元组的基础特征向量生成n组三元组特征向量,基于n组三元组特征向量筛选出用于训练量化索引网络的第一样本三元组集合。
步骤360,将n个样本三元组的基础特征向量输入量化索引网络;根据量化索引网络输出的量化索引的误差,筛选出用于训练嵌入向量网络的第二样本三元组集合;
量化索引网络由n个样本三元组的基础特征向量生成n组三元组量化索引,筛选出用于训练嵌入向量网络的第二样本三元组集合。
步骤381,基于第一样本三元组集合训练量化索引网络;
步骤382,基于第二样本三元组集合训练嵌入向量网络。
基于第一样本三元组集合训练量化索引网络,以及基于第二样本三元组集合训练嵌入向量网络。
可选地,将第一样本三元组集合输入量化索引网络得到n 1个第三误差损失;基于n 1个第三误差损失,训练量化索引网络,n 1为小于n的正整数。
可选地,将第二样本三元组集合输入嵌入向量网络得到n 2个第四误差损失;基于n 2个第四误差损失,训练嵌入向量网络,n 2为小于n的正整数。
综上所述,通过样本三元组训练嵌入向量网络和量化索引网络,不仅将通过嵌入向量网络筛选得到的第一样本三元组集合作为训练量化索引网络的样本三元组,还将通过量化索引网络筛选得到的第二样本三元组集合作为训练嵌入向量网络的样本三元组,上述检索模型的训练方法剔除了样本三元组中噪声样本三元组的影响,同时使得量化索引网络与嵌入向量网络对正负样本的预测效果相似,通过双分支预测噪声样本三元组,并使得双分支彼此学习表现优异的样本三元组,实现去噪学习,使得双分支具有相似的预测效果。
图4是本申请一个示例性实施例提供的检索模型的训练系统,其中检索模型的训练系统包括嵌入向量网络401和量化索引网络402。
嵌入向量网络401用于由n个样本三元组的基础特征向量生成n组三元组特征向量,三元组特征向量可用于计算样本三元组内训练样本、正样本和负样本之间的相似度距离。
量化索引网络402用于由n个样本三元组的基础特征向量生成n组三元组量化索引,三元组量化索引用于量化样本三元组。
图4所示的检索模型的训练系统,由样本构建得到样本三元组,并将构建得到的n个样本三元组分别输入嵌入向量网络401和量化索引网络402,之后进行筛选样本三元组,通过嵌入向量网络401筛选得到的第一样本三元组集合计算量化索引网络的损失函数,并基于损失函数值训练量化损失网络402;通过量化索引网络402筛选得到的第二样本三元组集合计算嵌入向量网络的损失函数,并基于损失函数值训练嵌入向量网络401。
值得说明的一点是,上述通过两个分支对n个样本三元组进行筛选,原理在于:噪声样本三元组(负样本与训练样本构成正样本对的样本三元组)在嵌入向量网络401和量化索引网络402的表现并不一定一致,如:噪声样本三元组中的负样本与训练样本通过嵌入向量网络401被认为构成正样本对,即产生高损失函数值,噪声样本三元组中的负样本与训练样本通过量化索引网络402被认为构成负样本对,产生低损失函数值。此时若把表现不一致的噪声样本三元组剔除,即可更好地训练嵌入向量网络401和量化索引网络402。
针对步骤320,获取用于训练检索模型的n个样本三元组。
在本申请的检索模型中,为训练嵌入向量网络和量化索引网络需引入样本三元组,样本三元组通过若干个相似样本对构建得到。
相似样本对:在本申请的检索模型中,为训练嵌入向量网络和量化索引网络需引入相似样本对。可选地,相似样本对中包含目标图像,与目标图像极度相似或相同的正样本,例如,视频中相邻的两帧图像可构建相似样本对。可选地,相似样本对中包含目标视频,与目标视频极度相似或相同的正样本。可选地,相似样本对中包含目标音频,与目标音频极度相似或相同的正样本,例如,相邻的两秒音频可构建相似样本对。
样本三元组:在本申请的检索模型中,为生成n个样本三元组,在每批训练(batch)的R个相似样本对中,针对每个相似样本对中的训练样本,从其余的(R-1)个相似样本对的每个相似样本对中随机选择一张图像,分别计算(R-1)张图像的特征向量与训练样本的特征向量的距离,将(R-1)个距离从小到大排序,选取与前n个距离对应的图像与训练样本所在的相似样本对构成n个样本三元组。即,获取用于训练检索模型的n个样本三元组。
示意性的,构建样本三元组的方法如下:
26个相似样本对包括(A,A’)、(B,B’)、(C,C’)、(D,D’)、…、(Z,Z’),针对相似样本对(A,A’),A为训练样本(anchor),A’为正样本(positive),从其余的相似样本对(B,B’)、(C,C’)、(D,D’)、…、(Z,Z’)中每个相似样本对随机选择一张图像,如选择B、C’、D、E’…、Z,分别计算B、C’、D、E’…、Z的特征向量与训练样本A的特征向量的距离,得到25个距离,将25个距离从小到大排序,选取与前20个距离对应的图像与训练样本A所在的相似样本对(A,A’)构成20个样本三元组。即,获取用于训练检索模型的20个样本三元组。
综上所述,上述方法实现了由相似样本对构建n个样本三元组,使得得到的n个样本三元组可用于本申请提供的检索模型的训练。
图5示出了本申请一个示例性实施例提供的检索模型的训练方法,以该方法应用于图2所示的计算机设备为例进行说明,其中步骤320、步骤360、步骤381和步骤382均与图3所示一致。
针对步骤340,将n个样本三元组的基础特征向量输入嵌入向量网络;根据嵌入向量网络输出的特征向量的误差,筛选出用于训练量化索引网络的第一样本三元组集合。
步骤340可以包括以下步骤:
步骤341,获取嵌入向量网络对n个样本三元组输出的n组三元组特征向量;
通过将n个样本三元组输入至嵌入向量网络401,嵌入向量网络401输出n组三元组特征向量。
步骤342,计算n组三元组特征向量对应的n个第一误差损失;
其中,计算n组三元组特征向量对应的n个第一误差损失包括:
针对每组三元组特征向量,计算训练样本的特征向量和正样本的特征向量之间的第一距离;针对每组三元组特征向量,计算训练样本的特征向量和负样本的特征向量之间的第二距离;计算第一距离和第二距离之间的差值与第一距离阈值之间的第一误差损失,第一距离阈值是训练样本与正样本之间的距离和训练样本与负样本之间的距离的差值的阈值。
示意性的,在每组三元组(a,p,n)中,对三元组特征向量进行L2范式归一化,然后计算第一误差损失。a(anchor)表示训练样本,p(positive)表示正样本,n(nagative)表示负样本。
第一误差损失的计算如下:
l tri=max(||X a-X p||-||X a-X n||+α,0);       (1)
其中,l tri表示第一误差损失,X a表示训练样本的特征向量,X p表示正样本的特征向量,X n表示负样本的特征向量,α表示第一距离阈值,||X a-X p||表示第一距离、||X a-X n||表示第二距离,可选的,α值为4,表示训练样本的特征向量与负样本的特征向量之间的距离,比,训练样本的特征向量与正样本的特征向量之间的距离大于4;采用L2归一化的目的是使得三元组样本的特征空间在0~1范围中,避免特征空间过大不利于优化训练检索模型。
步骤343,在n个第一误差损失由小到大的排序结果中,筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于训练量化索引网络的第一样本三元组集合。
通过步骤342,可以获得n个第一误差损失,将n个第一误差损失进行由小到大的排序,筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于训练量化索引网络的第一样本三元组集合,其中,n 1为小于n的正整数。
可选的,在n个第一误差损失由小到大的排序结果中,根据预设的x值,筛选出排序在前x%的n 1个第一误差损失所对应的样本三元组,添加至用于训练量化索引网络的第一样本三元组集合。示意性的,x值是基于噪声样本三元组占n个样本三元组的比例设置的,可选的,该比例基于预测或者标定得到。示意性的,x值略大于噪声样本三元组所占n个样本三 元组的比例。x为正数。
示意性的,n个第一误差损失的排序结果构成Lem_list,选取Lem_list中排在前85%的n 1个第一误差损失所对应的n 1个样本三元组,作为第一样本三元组集合。
值得说明的一点是,在训练检索模型时,根据多次试验的结果可预测n个样本三元组中存在10%的噪声样本三元组(即n个样本三元组中对某个相似样本对而言,可能存在10%的其他相似样本对中的图像能与训练样本构成正样本对),则选取Lem_list中排在前85%的n 1个第一误差损失所对应的n 1个样本三元组,作为第一样本三元组集合。但是,在每轮(epoch)检索模型的训练中,并不能始终保证噪声样本三元组都排在最后的10%中,尤其在刚开始训练的前几轮,故少取5%可一定程度避免取到噪声样本三元组。
综上所述,上述方法通过设计嵌入向量网络的损失函数,实现了计算n个样本三元组的n组三元组特征向量的损失函数值,并基于损失函数值筛选用于训练量化索引网络的第一样本三元组集合,进一步采用更干净的样本三元组优化量化索引网络的量化效果。
图6示出了本申请一个示例性实施例提供的检索模型的训练方法,以该方法应用于图2所示的计算机设备为例进行说明,其中步骤320、步骤340、步骤381和步骤382均与图3所示一致。
针对步骤360,将n个样本三元组的基础特征向量输入量化索引网络;根据量化索引网络输出的量化索引的误差,筛选出用于训练嵌入向量网络的第二样本三元组集合;
步骤360可以包括以下步骤:
步骤361,获取量化索引网络对n个样本三元组输出的n组三元组量化索引;
通过将n个样本三元组的基础特征向量输入量化索引网络,输出n组三元组量化索引。
步骤362,计算n组三元组量化索引对应的n个第二误差损失;
其中,计算n组三元组量化索引对应的n个第二误差损失的方法包括:
针对每组三元组量化索引,计算三元组量化索引的第一三元组损失;针对每组三元组量化索引,计算三元组量化索引的第一量化误差损失;对第一三元组损失和第一量化误差损失进行加权求和,得到第二误差损失。
计算第一三元组损失,示意性的,将三元组量化索引通过激活函数激活,再计算量化索引的第一三元组损失,为保证样本三元组在量化空间可区分,故样本三元组的训练样本与负样本的距离需要足够大,可选的,设置margin(公式(1)中的α)为160。
第一三元组损失的计算方式与上述第一误差损失的计算方式类似,不再赘述,此处的区别在于三元组量化索引的维度与三元组特征向量的维度不一致。
量化索引网络402输出的n组三元组量化索引,可选的,三元组量化索引为256维,通过激活函数将每维转化为(-1,1)的值,之后再进行符号量化,最后得到256维的三元组量化索引。
计算第一量化误差损失,为更好的训练量化索引网络,欲采用的三元组量化索引的每维数值尽可能接近-1或1(若每维数值处于临界值,即0附近,容易造成相似的基础特征向量量化至不同段),因此设置量化损失函数如下式(2)(3),首先计算:
Figure PCTCN2022107973-appb-000001
其中u i为未进行符号量化(小于0的数值量化为0,大于0的数值量化为1)的三元组量化索引每维的数值,u i的范围为(-1,1),b i为转化后三元组量化索引每维的数值(-1或1);
然后计算b i和u i之间的回归损失,
Figure PCTCN2022107973-appb-000002
其中,L coding为第一量化误差损失。
其中上述激活函数可设置为Tanh函数或sgn函数,值得说明的一点是,sgn函数在0位置是不可导的(+0和-0),即没法进行梯度计算(从而不能用于基于sgd梯度回传的深度学 习中),而采用Tanh激活则可导并且可映射到-1到1之间。另外激活函数也可以采用sigmoid(激活到0~1之间),然后把0、1作为量化的目标(而非-1、1)。其中tanh由于更快能获得(-1,1),训练效果更好。
计算第二误差损失,基于上述已求得的第一三元组损失和第一量化误差损失,进行加权求和可得第二误差损失。
L q=w 21L triplet+w 22L coding;     (4)
其中,此时L q输出为第二误差损失,L triplet为第一三元组损失,L coding为第一量化误差损失,w 21和w 22为权重,可选的,由于第一量化误差损失收敛比第一三元组损失收敛快,为了保证第一三元组损失在整体第二误差损失中处于主导地位,从而保证嵌入向量网络始终具有相似度度量的能力,故此处w 21设为1,w 22设为0.5。
值得说明的一点是,上述w 21和w 22的值并不固定,只需保证w 22的值小于w 21即可。
步骤363,在n个第二误差损失由小到大的排序结果中,筛选出排序在第二选取范围内的n 2个第二误差损失所对应的样本三元组,添加至用于训练嵌入向量网络的第二样本三元组集合。
通过步骤362,可以获得n个第二误差损失,将n个第二误差损失进行由小到大的排序,筛选出排序在第二选取范围内的n 2个第二误差损失所对应的样本三元组,添加至用于训练量化索引网络的第二样本三元组集合。
可选的,在n个第二误差损失由小到大的排序结果中,根据预设的y值,筛选出排序在前y%的n 2个第二误差损失所对应的样本三元组,添加至用于训练量化索引网络的第二样本三元组集合。示意性的,y值是基于噪声样本三元组占n个样本三元组的比例设置的,可选的,该比例基于预测或者标定得到。示意性的,y值略大于噪声样本三元组所占n个样本三元组的比例。y为正数。
示意性的,n个第二误差损失的排序结果构成Lq_list,选取Lq_list中排在前85%的n 2个第二误差损失所对应的n 2个样本三元组,作为第二样本三元组集合,n 2为小于n的正整数。
值得说明的一点是,在训练检索模型时,根据多次试验的结果可预测n个样本三元组中存在10%的噪声样本三元组(即n个样本三元组中对某个相似样本对而言,可能存在10%的其他相似样本对中的图像能与训练样本构成正样本对),则选取Lq_list中排在前85%的n 2个第二误差损失所对应的n 2个样本三元组,作为第二样本三元组集合。但是,在每轮(epoch)检索模型的训练中,并不能始终保证噪声样本三元组都排在最后的10%中,尤其在刚开始训练的前几轮,故少取5%可一定程度避免取到噪声样本三元组。
综上所述,上述方法通过设计量化索引网络的损失函数,实现了计算n个样本三元组的n组三元组量化索引的损失函数值,并基于损失函数值筛选用于训练嵌入向量网络的第二样本三元组集合,进一步采用更干净的样本三元组优化嵌入向量网络的特征向量的计算效果。
图7示出了本申请一个示例性实施例提供的检索模型的训练方法,以该方法应用于图2所示的计算机设备为例进行说明,其中步骤320、步骤340和步骤360均与图3所示一致。
针对步骤381,基于第一样本三元组集合训练量化索引网络,包括:
步骤381-1,将第一样本三元组集合输入量化索引网络得到n 1个第三误差损失;
其中,第一样本三元组集合包括n 1个样本三元组,第一样本三元组是基于嵌入向量网络401输出的n组三元组特征向量的第一误差损失排序得到的,具体参考上述“针对步骤340”。
其中,将第一样本三元组集合输入量化索引网络得到n 1个第三误差损失包括:
针对第一样本三元组集合的每个样本三元组,通过量化索引网络计算三元组特征向量的第二三元组损失,其中,三元组特征向量是嵌入向量网络输出的特征向量;针对第一样本三元组集合的每个样本三元组,通过量化索引网络计算三元组特征向量的第二量化误差损失;对第二三元组损失和第二量化误差损失进行加权求和,得到第三误差损失。
计算n 1个第三误差损失的方法与上述步骤362中计算n个第二误差损失的方式相类似,不再赘述。
步骤381-2,基于n 1个第三误差损失,训练量化索引网络。
计算机设备基于n 1个第三误差损失,训练量化索引网络。
针对步骤382,基于第二样本三元组集合训练嵌入向量网络,包括:
步骤382-1,将第二样本三元组集合输入嵌入向量网络得到n 2个第四误差损失;
其中,第二样本三元组集合包括n 2个样本三元组,第二样本三元组是基于量化索引网络402输出的n组三元组量化索引的第二误差损失排序得到的,具体参考上述“针对步骤360”。
其中,将第二样本三元组集合输入嵌入向量网络得到n 2个第四误差损失包括:
针对第二样本三元组集合的每个样本三元组,通过嵌入向量网络计算训练样本的特征向量和正样本的特征向量之间的第三距离;
针对第二样本三元组集合的每个样本三元组,通过嵌入向量网络计算训练样本的特征向量和负样本的特征向量之间的第四距离;
计算第三距离和第四距离之间的差值与第二距离阈值之间的第四误差损失,第二距离阈值是训练样本与正样本之间的距离和训练样本与负样本的距离的差值的阈值。
计算n 2个第四误差损失的方法与上述步骤342中计算n个第一误差损失的方式相类似,不再赘述。
步骤382-2,基于n 2个第四误差损失,训练嵌入向量网络。
计算机设备基于n 2个第四误差损失,训练嵌入向量网络。
综上所述,采用第一样本三元组集合训练量化索引网络,以及采用第二样本三元组集合训练嵌入向量网络,优化了量化索引网络的量化效果和嵌入向量网络的特征向量计算效果。
图8示出了本申请一个示例性实施例提供的检索模型的训练系统,其比图4所示的检索模型的训练系统多增加了基础特征网络403。
基于图3所示的检索模型的训练方法,在步骤320和步骤340之间还包括:通过基础特征网络,获取n个样本三元组的基础特征向量。
基础特征网络403用于提取输入的样本三元组的图像基础特征,基础特征包括但不限于颜色特征、纹理特征、形状特征和空间关系特征。
图9示出了本申请一个示例性实施例提供的检索模型的训练方法的示意图,以该方法应用于图2所示的计算机设备进行举例说明,该方法包括:
步骤320,获取用于训练检索模型的n个样本三元组;
计算机设备获取用于训练检索模型的n个样本三元组。
关于步骤320的详细介绍,可参考上述“针对步骤320”。
步骤330,通过基础特征网络,获取n个样本三元组的基础特征向量;
通过基础特征网络,计算机设备获取n个样本三元组的基础特征向量。
步骤341,获取嵌入向量网络对n个样本三元组输出的n组三元组特征向量;
计算机设备获取嵌入向量网络对n个样本三元组输出的n组三元组特征向量。
步骤342,计算n组三元组特征向量对应的n个第一误差损失;
计算机设备计算n组三元组特征向量对应的n个第一误差损失。
步骤343,在n个第一误差损失由小到大的排序结果中,筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于训练量化索引网络的第一样本三元组集合。
在n个第一误差损失由小到大的排序结果中,计算机设备筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于训练量化索引网络的第一样本三元组集合。
关于步骤341、步骤342和步骤343的详细介绍,可参考图5所示的实施例。
步骤381-1,将第一样本三元组集合输入量化索引网络得到n 1个第三误差损失;
计算机设备将第一样本三元组集合输入量化索引网络得到n 1个第三误差损失。其中,将第一样本三元组集合输入量化索引网络得到n 1个第三误差损失包括:
针对第一样本三元组集合的每个样本三元组,计算机设备计算三元组特征向量的第二三元组损失;针对第一样本三元组集合的每个样本三元组,计算机设备计算三元组特征向量的第二量化误差损失;计算机设备对第二三元组损失和第二量化误差损失进行加权求和,得到第三误差损失。
步骤381-2,基于n 1个第三误差损失,训练量化索引网络;
计算机设备基于n 1个第三误差损失,训练量化索引网络。
步骤361,获取量化索引网络对n个样本三元组输出的n组三元组量化索引;
计算机设备通过将n个样本三元组的基础特征向量输入量化索引网络,输出n组三元组量化索引。
步骤362,计算n组三元组量化索引对应的n个第二误差损失;
计算机设备计算n组三元组量化索引对应的n个第二误差损失的方法包括:
针对每组三元组量化索引,计算机设备计算三元组量化索引的第一三元组损失;针对每组三元组量化索引,计算机设备计算三元组量化索引的第一量化误差损失;计算机设备对第一三元组损失和第一量化误差损失进行加权求和,得到第二误差损失。
步骤363,在n个第二误差损失由小到大的排序结果中,筛选出排序在第二选取范围内的n 2个第二误差损失所对应的样本三元组,添加至用于训练嵌入向量网络的第二样本三元组集合;
计算机设备在n个第二误差损失由小到大的排序结果中,筛选出排序在前y%的n 2个第二误差损失所对应的样本三元组,添加至用于训练量化索引网络的第二样本三元组集合。
关于步骤361、步骤362和步骤363的详细介绍,可参考图6所示的实施例。
步骤382-1,将第二样本三元组集合输入嵌入向量网络得到n 2个第四误差损失;
计算机设备将第二样本三元组集合输入嵌入向量网络得到n 2个第四误差损失。
步骤382-2,基于n 2个第四误差损失,训练嵌入向量网络。
计算机设备基于n 2个第四误差损失,训练嵌入向量网络。
关于步骤381-1、步骤381-2、步骤382-1和步骤382-2的详细介绍,可参考图7所示的实施例。
综上所述,通过样本三元组训练嵌入向量网络和量化索引网络,不仅将通过嵌入向量网络筛选得到的第一样本三元组集合作为训练量化索引网络的样本三元组,还将通过量化索引网络筛选得到的第二样本三元组集合作为训练嵌入向量网络的样本三元组,上述方法使得嵌入向量网络支持计算m个候选图像的特征向量与查询图像的特征向量之间的距离,还使得量化索引网络得到的查询图像的量化索引更加准确。上述检索模型的训练方法剔除了噪声样本三元组,同时使得量化索引网络与嵌入向量网络对正负样本的预测效果相似,通过双分支预测噪声样本三元组,并使得双分支彼此学习表现优异的样本三元组,实现去噪学习,使得双分支具有相似的预测效果。
基于上述,已完整论述了本申请的技术方案,接下来介绍检索模型的相关训练参数。
针对基础特征网络,可选的,基础特征网络是resnet101网络(一种卷积神经网络),即,基础特征网络可采用resnet101网络训练,具体参数详见下表1,其中基础特征网络的Conv1-Conv5采用在ImageNet(大型通用物体识别开源数据集)数据集上预训练的ResNet101的参数,可选的,基础特征网络还可以采用resnet18CNN网络。
表1
Figure PCTCN2022107973-appb-000003
针对嵌入向量网络,采用如下表2的参数训练,嵌入向量网络也可称为Embedding网络,嵌入向量网络采用方差为0.01,均值为0的高斯分布进行初始化,可选的,嵌入向量网络还可以采用多层Fc连接,嵌入向量网络输出64维向量。
表2
Figure PCTCN2022107973-appb-000004
针对量化索引网络,采用如下表3所示的参数训练,量化索引网络采用方差为0.01,均值为0的高斯分布进行初始化。
表3
Figure PCTCN2022107973-appb-000005
针对检索模型:
设置学习参数:在更新检索模型时,第一阶段训练嵌入向量网络时需要更新底层基础特征,设置学习参数如表1和表2所示。第二阶段训练量化索引网络时,不需要更新基础特征网络和嵌入向量网络。
设置学习率:基础特征网络、嵌入向量网络和量化索引网络均采用lr1=0.005学习率,每经过10轮迭代后lr变为原来的0.1倍。
设置学习过程:对全量样本,进行epoch轮迭代,每轮迭代分批次处理全量样本,直到某epoch下平均epoch损失(由上述第三误差损失和第四误差损失加权求和得到的检索模型的总损失)不再下降。
设置每轮迭代中的具体操作:将所有的相似样本对按批尺寸(batch-size)分为Nb个批次,对于每批次,获取到上述n个样本三元组。
设置模型参数更新:采用SGD(一种梯度下降方法)随机梯度下降法,把上一批次得到的损失函数值进行梯度后向计算得到嵌入向量网络和量化索引网络的参数的更新值,并更新 网络。
为实现通过构建深度量化网络进行对象检索,可选地,实现对象检索包括以下步骤:1、获取查询对象的基础特征向量;2、将基础特征向量输入至量化索引网络和嵌入向量网络;3、通过量化索引网络获取查询对象的量化索引,以及通过嵌入向量网络获取查询对象的特征向量;4、、基于量化索引,从量化码本中索引得到m个候选对象的特征向量;量化码本存储有量化索引与m个候选对象的特征向量之间的映射关系,m为正整数;5、分别计算m个候选对象的特征向量与查询对象的特征向量的第五距离,得到m个第五距离;6、在m个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选对象,z为正数。接下来,以查询对象是查询图像为例进行具体说明。图10是本申请一个示例性实施例提供的图像检索方法的流程图。本实施例以该方法由图2所示的计算机设备来执行进行举例说明,所述方法包括:
步骤1001,获取查询图像的基础特征向量;
查询图像是用于图像检索的图像。
在一个实施例中,检索模型还包括基础特征网络,通过基础特征网络,获取n个样本三元组的基础特征向量,示意性的,图11示出了本申请一个示例性实施例提供的检索模型的使用系统的示意图,其中,将查询图像1101输入基础特征网络403,即可获的查询图像1101的基础特征向量。
步骤1002,将基础特征向量输入至量化索引网络和嵌入向量网络;
结合参考图11,将查询图像1101的基础特征向量输入至量化索引网络402和嵌入向量网络401。
步骤1003,通过量化索引网络获取查询图像的量化索引,以及通过嵌入向量网络获取查询图像的特征向量;
结合参考图11,通过量化索引网络402获取查询图像1101的量化索引(1,0,0),以及,通过嵌入向量网络401获取查询图像1101的特征向量(0.2,0.8,0.3,0.3)。
步骤1004,基于量化索引,从量化码本中索引得到m个候选图像的特征向量;
其中,量化码本存储有量化索引与m个候选图像的特征向量之间的映射关系,m为正整数。
结合参考图11,基于量化索引(1,0,0),从量化码本中索引得到3个候选图像的特征向量:(0.2,0.7,0.3,0.3)、(0.1,0.5,0.2,0.2)和(0.2,0.4,0.2,0.3)。
图11还示出了本申请一个示例性实施例提供的量化码本的构建过程,将图像库中的所有图像进行特征提取,得到量化索引和特征向量。
在一个实施例中,量化码本是由以下步骤构建的:
第一、对图像库中任意一个图像i输入上述检索模型,其中可由嵌入向量网络401得到特征向量e,由量化索引网络402得到量化索引q(q是经符号函数得到的每维度为0或1的向量),纪录图像i嵌入向量网络401的映射表T[i:e](其中i表示图像的序号,e表示图像i经嵌入向量网络输出的特征向量)。
第二、将具有相同q的图像序号记录到q的映射表Linvert[i:q],如,{q1:[图像1,图像2,图像5],q2:[图像3],q3:[图像4]},保存所有量化索引的列表Lindex:[q1,q2,q3]。
第三、对于新加入图像库的图像i’,可以计算其qi’和ei’,当qi’存在于列表Lindex中时,直接把i’加入到Lindex下qi’对应的映射表Linvert中,把图像序号i’和ei’加入到T映射表(新增一个序号与特征的记录,如i’:ei’)。
上述量化索引的列表Lindex、q的映射表Linvert和T列表共同组成了一个量化码本。
在一个实施例中,基于量化索引,从量化码本中索引得到m个候选图像的特征向量包括:计算得到查询图像的量化索引,从列表Lindex中确定与查询图像的量化索引的汉明距离小于 阈值的若干个量化索引,并从映射表Linvert中确定与若干个量化索引对应的m个候选图像,通过T映射表中确定m个候选图像的特征向量。
步骤1005,分别计算m个候选图像的特征向量与查询图像的特征向量的第五距离,得到m个第五距离;
结合参考图11,分别计算查询图像1101的特征向量(0.2,0.8,0.3,0.3)与3个候选图像的特征向量:(0.2,0.7,0.3,0.3)、(0.1,0.5,0.2,0.2)和(0.2,0.4,0.2,0.3)的第五距离,得到3个第五距离。第五距离为欧式距离。
步骤1006,在m个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选图像。
结合参考图11,在3个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选图像,其中,z%可由预先配置检索模型得到,可选的,根据检索需求可合理设置z值,使得筛选得到的候选图像满足检索模型的检索预期。z为正数。
在一个实施例中,计算机设备还将筛选得到的候选图像发送至运行有图像检索功能的客户端。
综上所述,上述方法通过包含有基础特征网络、嵌入向量网络和量化索引网络的检索模型,可进行查询图像与m个候选图像的距离排序并筛选出排序靠前的图像,上述方法不仅实现了筛选得到的图像更接近查询图像,还避免了在确定m个候选图像时丢失或额外增加候选图像。
图12是本申请一个示例性实施例提供的音频检索方法的流程图。本实施例以该方法由图2所示的计算机设备来执行进行举例说明,所述方法包括:
步骤1201,获取查询音频的基础特征向量;
查询音频是用于音频检索的音频。
在一个实施例中,检索模型还包括基础特征网络,通过基础特征网络,获取n个样本三元组的基础特征向量。
步骤1202,将基础特征向量输入至量化索引网络和嵌入向量网络;
示例性的,将查询音频的基础特征向量输入至量化索引网络和嵌入向量网络。
步骤1203,通过量化索引网络获取查询音频的量化索引,以及通过嵌入向量网络获取查询音频的特征向量;
示例性的,通过量化索引网络获取查询音频的量化索引(1,0,0),以及,通过嵌入向量网络获取查询音频的特征向量(0.2,0.8,0.3,0.3)。
步骤1204,基于量化索引,从量化码本中索引得到m个候选音频的特征向量;
其中,量化码本存储有量化索引与m个候选音频的特征向量之间的映射关系,m为正整数。
示例性的,基于量化索引(1,0,0),从量化码本中索引得到3个候选音频的特征向量:(0.2,0.7,0.3,0.3)、(0.1,0.5,0.2,0.2)和(0.2,0.4,0.2,0.3)。
在一个实施例中,量化码本是由以下步骤构建的:
第一、对音频库中任意一个音频i输入上述检索模型,其中可由嵌入向量网络401得到特征向量e,由量化索引网络402得到量化索引q(q是经符号函数得到的每维度为0或1的向量),纪录音频i嵌入向量网络401的映射表T[i:e](其中i表示音频的序号,e表示音频i经嵌入向量网络输出的特征向量)。
第二、将具有相同q的音频序号记录到q的映射表Linvert[i:q],如,{q1:[音频1,音频2,音频5],q2:[音频3],q3:[音频4]},保存所有量化索引的列表Lindex:[q1,q2,q3]。
第三、对于新加入音频库的音频i’,可以计算其qi’和ei’,当qi’存在于列表Lindex中时,直接把i’加入到Lindex下qi’对应的映射表Linvert中,把音频序号i’和ei’加入到T映射表 (新增一个序号与特征的记录,如i’:ei’)。
上述量化索引的列表Lindex、q的映射表Linvert和T列表共同组成了一个量化码本。
在一个实施例中,基于量化索引,从量化码本中索引得到m个候选音频的特征向量包括:计算得到查询音频的量化索引,从列表Lindex中确定与查询音频的量化索引的汉明距离小于阈值的若干个量化索引,并从映射表Linvert中确定与若干个量化索引对应的m个候选音频,通过T映射表中确定m个候选音频的特征向量。
步骤1205,分别计算m个候选音频的特征向量与查询音频的特征向量的第五距离,得到m个第五距离;
示例性的,分别计算查询音频1101的特征向量(0.2,0.8,0.3,0.3)与3个候选音频的特征向量:(0.2,0.7,0.3,0.3)、(0.1,0.5,0.2,0.2)和(0.2,0.4,0.2,0.3)的第五距离,得到3个第五距离。第五距离为欧式距离。
步骤1206,在m个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选音频。
示例性的,在3个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选音频,其中,z%可由预先配置检索模型得到,可选的,根据检索需求可合理设置z值,使得筛选得到的候选音频满足检索模型的检索预期。z为正数。
在一个实施例中,计算机设备还将筛选得到的候选音频发送至运行有音频检索功能的客户端。
综上所述,上述方法通过包含有基础特征网络、嵌入向量网络和量化索引网络的检索模型,可进行查询音频与m个候选音频的距离排序并筛选出排序靠前的音频,上述方法不仅实现了筛选得到的音频更接近查询音频,还避免了在确定m个候选音频时丢失或额外增加候选音频。
图13是本申请一个示例性实施例提供的视频检索方法的流程图。本实施例以该方法由图2所示的计算机设备来执行进行举例说明,所述方法包括:
步骤1301,获取查询视频的基础特征向量;
查询视频是用于视频检索的视频。
在一个实施例中,检索模型还包括基础特征网络,通过基础特征网络,获取n个样本三元组的基础特征向量。
步骤1302,将基础特征向量输入至量化索引网络和嵌入向量网络;
示例性的,将查询视频的基础特征向量输入至量化索引网络和嵌入向量网络。
步骤1303,通过量化索引网络获取查询视频的量化索引,以及通过嵌入向量网络获取查询视频的特征向量;
示例性的,通过量化索引网络获取查询视频的量化索引(1,0,0),以及,通过嵌入向量网络获取查询视频的特征向量(0.2,0.8,0.3,0.3)。
步骤1304,基于量化索引,从量化码本中索引得到m个候选视频的特征向量;
其中,量化码本存储有量化索引与m个候选视频的特征向量之间的映射关系,m为正整数。
示例性的,基于量化索引(1,0,0),从量化码本中索引得到3个候选视频的特征向量:(0.2,0.7,0.3,0.3)、(0.1,0.5,0.2,0.2)和(0.2,0.4,0.2,0.3)。
在一个实施例中,量化码本是由以下步骤构建的:
第一、对视频库中任意一个视频i输入上述检索模型,其中可由嵌入向量网络401得到特征向量e,由量化索引网络402得到量化索引q(q是经符号函数得到的每维度为0或1的向量),纪录视频i嵌入向量网络401的映射表T[i:e](其中i表示视频的序号,e表示视频i经嵌入向量网络输出的特征向量)。
第二、将具有相同q的视频序号记录到q的映射表Linvert[i:q],如,{q1:[视频1,视频2,视频5],q2:[视频3],q3:[视频4]},保存所有量化索引的列表Lindex:[q1,q2,q3]。
第三、对于新加入视频库的视频i’,可以计算其qi’和ei’,当qi’存在于列表Lindex中时,直接把i’加入到Lindex下qi’对应的映射表Linvert中,把视频序号i’和ei’加入到T映射表(新增一个序号与特征的记录,如i’:ei’)。
上述量化索引的列表Lindex、q的映射表Linvert和T列表共同组成了一个量化码本。
在一个实施例中,基于量化索引,从量化码本中索引得到m个候选视频的特征向量包括:计算得到查询视频的量化索引,从列表Lindex中确定与查询视频的量化索引的汉明距离小于阈值的若干个量化索引,并从映射表Linvert中确定与若干个量化索引对应的m个候选视频,通过T映射表中确定m个候选视频的特征向量。
步骤1305,分别计算m个候选视频的特征向量与查询视频的特征向量的第五距离,得到m个第五距离;
示例性的,分别计算查询视频1101的特征向量(0.2,0.8,0.3,0.3)与3个候选视频的特征向量:(0.2,0.7,0.3,0.3)、(0.1,0.5,0.2,0.2)和(0.2,0.4,0.2,0.3)的第五距离,得到3个第五距离。第五距离为欧式距离。
步骤1306,在m个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选视频。
示例性的,在3个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选视频,其中,z%可由预先配置检索模型得到,可选的,根据检索需求可合理设置z值,使得筛选得到的候选视频满足检索模型的检索预期。z为正数。
在一个实施例中,计算机设备还将筛选得到的候选视频发送至运行有视频检索功能的客户端。
综上所述,上述方法通过包含有基础特征网络、嵌入向量网络和量化索引网络的检索模型,可进行查询视频与m个候选视频的距离排序并筛选出排序靠前的视频,上述方法不仅实现了筛选得到的视频更接近查询视频,还避免了在确定m个候选视频时丢失或额外增加候选视频。
图14是本申请一个示例性实施例提供的检索模型的训练装置的结构框图,检索模型包括嵌入向量网络和量化索引网络,嵌入向量网络用于获取图像的特征向量,量化索引网络用于提取图像的量化索引,该检索模型的训练装置包括:
获取模块1401,用于获取用于训练检索模型的n个样本三元组;样本三元组包括训练样本、与训练样本构成相似样本对的正样本、以及与训练样本不构成相似样本对的负样本,n为大于1的正整数;
筛选模块1402,用于将n个样本三元组的基础特征向量输入嵌入向量网络;根据嵌入向量网络输出的特征向量的误差,筛选出用于训练量化索引网络的第一样本三元组集合;
筛选模块1402,还用于将n个样本三元组的基础特征向量输入量化索引网络;根据量化索引网络输出的量化索引的误差,筛选出用于训练嵌入向量网络的第二样本三元组集合;
训练模块1403,用于基于第一样本三元组集合训练量化索引网络,以及基于第二样本三元组集合训练嵌入向量网络。
在一个可选的实施例中,筛选模块1402,还用于获取嵌入向量网络对n个样本三元组输出的n组三元组特征向量。
在一个可选的实施例中,筛选模块1402,还用于计算n组三元组特征向量对应的n个第一误差损失。
在一个可选的实施例中,筛选模块1402,还用于在n个第一误差损失由小到大的排序结果中,筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于 训练量化索引网络的第一样本三元组集合。
在一个可选的实施例中,筛选模块1402,还用于针对每组三元组特征向量,计算训练样本的特征向量和正样本的特征向量之间的第一距离。
在一个可选的实施例中,筛选模块1402,还用于针对每组三元组特征向量,计算训练样本的特征向量和负样本的特征向量之间的第二距离。
在一个可选的实施例中,筛选模块1402,还用于计算第一距离和第二距离之间的差值与第一距离阈值之间的第一误差损失,第一距离阈值是训练样本与正样本之间的距离和训练样本与负样本之间的距离的差值的阈值。
在一个可选的实施例中,筛选模块1402,还用于在n个第一误差损失由小到大的排序结果中,根据预设的x值,筛选出排序在前x%的n 1个第一误差损失所对应的样本三元组,添加至用于训练量化索引网络的第一样本三元组集合,n 1为小于n的正整数。x为正数。
在一个可选的实施例中,筛选模块1402,还用于获取量化索引网络对n个样本三元组输出的n组三元组量化索引。
在一个可选的实施例中,筛选模块1402,还用于计算n组三元组量化索引对应的n个第二误差损失。
在一个可选的实施例中,筛选模块1402,还用于在n个第二误差损失由小到大的排序结果中,筛选出排序在第二选取范围内的n 2个第二误差损失所对应的样本三元组,添加至用于训练嵌入向量网络的第二样本三元组集合,n 2为小于n的正整数。
在一个可选的实施例中,筛选模块1402,还用于针对每组三元组量化索引,计算三元组量化索引的第一三元组损失。
在一个可选的实施例中,筛选模块1402,还用于针对每组三元组量化索引,计算三元组量化索引的第一量化误差损失。
在一个可选的实施例中,筛选模块1402,还用于对第一三元组损失和第一量化误差损失进行加权求和,得到第二误差损失。
在一个可选的实施例中,筛选模块1402,还用于在n个第二误差损失由小到大的排序结果中,根据预设的y值,筛选出排序在前y%的n 2个第二误差损失所对应的样本三元组,添加至用于训练量化索引网络的第二样本三元组集合。y为正数。
在一个可选的实施例中,训练模块1403,还用于将第一样本三元组集合输入量化索引网络得到n 1个第三误差损失。
在一个可选的实施例中,训练模块1403,还用于基于n 1个第三误差损失,训练量化索引网络。
在一个可选的实施例中,训练模块1403,还用于针对第一样本三元组集合的每个样本三元组,计算三元组特征向量的第二三元组损失。
在一个可选的实施例中,训练模块1403,还用于针对第一样本三元组集合的每个样本三元组,计算三元组特征向量的第二量化误差损失。
在一个可选的实施例中,训练模块1403,还用于对第二三元组损失和第二量化误差损失进行加权求和,得到第三误差损失。
在一个可选的实施例中,训练模块1403,还用于将第二样本三元组集合输入嵌入向量网络得到n 2个第四误差损失。
在一个可选的实施例中,训练模块1403,还用于基于n 2个第四误差损失,训练嵌入向量网络。
在一个可选的实施例中,训练模块1403,还用于针对第二样本三元组集合的每个样本三元组,计算训练样本的特征向量和正样本的特征向量之间的第三距离。
在一个可选的实施例中,训练模块1403,还用于针对第二样本三元组集合的每个样本三元组,计算训练样本的特征向量和负样本的特征向量之间的第四距离。
在一个可选的实施例中,训练模块1403,还用于计算第三距离和第四距离之间的差值与第二距离阈值之间的第四误差损失,第二距离阈值是训练样本与正样本之间的距离和训练样本与负样本的距离的差值的阈值。
在一个可选的实施例中,检索模型还包括基础特征网络。
在一个可选的实施例中,获取模块1401,还用于通过基础特征网络,获取n个样本三元组的基础特征向量。
综上所述,上述检索模型的训练装置通过样本三元组训练嵌入向量网络和量化索引网络,不仅将通过嵌入向量网络筛选得到的第一样本三元组集合作为训练量化索引网络的样本三元组,还将通过量化索引网络筛选得到的第二样本三元组集合作为训练嵌入向量网络的样本三元组,上述检索模型的训练装置剔除了样本三元组中噪声样本三元组的影响,同时使得量化索引网络与嵌入向量网络对正负样本的预测效果相似,通过双分支预测噪声样本三元组,并使得双分支彼此学习表现优异的样本三元组,实现去噪学习,使得双分支具有相似的预测效果。
图15示出了本申请一个示例性实施例提供的对象检索装置的结构框图,该装置包括:
获取模块1501,用于获取查询对象的基础特征向量;
输入模块1502,用于将基础特征向量输入至量化索引网络和嵌入向量网络;
获取模块1501,还用于通过量化索引网络获取查询对象的量化索引,以及通过嵌入向量网络获取查询对象的特征向量;
索引模块1503,用于基于量化索引,从量化码本中索引得到m个候选对象的特征向量;量化码本存储有量化索引与m个候选对象的特征向量之间的映射关系,m为正整数;
计算模块1504,用于分别计算m个候选对象的特征向量与查询对象的特征向量的第五距离,得到m个第五距离;
筛选模块1505,用于在m个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选对象。z为正数。
在一个可选的实施例中,获取模块1501还用于通过检索模型的基础特征网络,生成查询图像的基础特征向量。
综上所述,上述检索装置通过包含有基础特征网络、嵌入向量网络和量化索引网络的检索模型,可进行查询对象与m个候选对象的距离排序并筛选出排序靠前的对象,上述对象检索装置不仅实现了筛选出的对象更接近查询对象,还避免了在确定m个候选对象时丢失或额外增加候选对象。
需要说明的是:上述实施例提供的对象检索的训练装置和对象检索装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的检索模型的训练装置,与检索模型的训练方法实施例属于同一构思,图像检索装置与图像检索方法属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图16示出了本申请一个示例性实施例提供的计算机设备1600的结构框图。该计算机设备可以为终端或服务器,在本实施例可以简单描述为终端单独训练检索模型和/或终端单独使用检索模型,或,服务器单独训练检索模型和/或服务器单独使用检索模型,或,终端和服务器共同训练检索模型和/或终端和服务器共同使用检索模型。
通常,计算机设备1600包括有:处理器1601和存储器1602。
处理器1601可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1601可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至 少一种硬件形式来实现。处理器1601也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1601可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1601还可以包括AI处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1602可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1602还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1602中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器1601所执行以实现本申请中方法实施例提供的检索模型的训练方法或图像检索方法。
在一些实施例中,计算机设备1600还可选包括有:外围设备接口1603和至少一个外围设备。处理器1601、存储器1602和外围设备接口1603之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1603相连。具体地,外围设备包括:射频电路1604、显示屏1605、摄像头组件1606、音频电路1607、定位组件1608和电源1609中的至少一种。
外围设备接口1603可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器1601和存储器1602。在一些实施例中,处理器1601、存储器1602和外围设备接口1603被集成在同一芯片或电路板上;在一些其他实施例中,处理器1601、存储器1602和外围设备接口1603中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路1604用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。
显示屏1605用于显示UI(User Interface,用户界面)。
摄像头组件1606用于采集图像或视频。
音频电路1607可以包括麦克风和扬声器。
电源1609用于为计算机设备1600中的各个组件进行供电。
在一些实施例中,计算机设备1600还包括有一个或多个传感器1610。该一个或多个传感器1610包括但不限于:加速度传感器1611、陀螺仪传感器1612、压力传感器1613、指纹传感器1614、光学传感器1615以及接近传感器1616。
本领域技术人员可以理解,图16中示出的结构并不构成对计算机设备1600的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
本申请还提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述方法实施例提供的检索模型的训练方法或对象检索方法。
本申请提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述方法实施例提供的检索模型的训练方法或对象检索方法。
在一个实施例中,上述样本三元组、基础特征网络输出的基础特征向量、嵌入向量网络输出的特征向量和量化索引网络输出的量化索引可以存储在数据共享系统的节点中。参见图17所示的数据共享系统,数据共享系统1700是指用于进行节点与节点之间数据共享的系统,该数据共享系统中可以包括多个节点1701,多个节点1701可以是指数据共享系统中各个客户端。每个节点1701在进行正常工作可以接收到输入信息,并基于接收到的输入信息维护该数据共享系统内的共享数据。为了保证数据共享系统内的信息互通,数据共享系统中的每个 节点之间可以存在信息连接,节点之间可以通过上述信息连接进行信息传输。例如,当数据共享系统中的任意节点接收到输入信息时,数据共享系统中的其他节点便根据共识算法获取该输入信息,将该输入信息作为共享数据中的数据进行存储,使得数据共享系统中全部节点上存储的数据均一致。
对于数据共享系统中的每个节点,均具有与其对应的节点标识,而且数据共享系统中的每个节点均可以存储有数据共享系统中其他节点的节点标识,以便后续根据其他节点的节点标识,将生成的区块广播至数据共享系统中的其他节点。每个节点中可维护一个如下表所示的节点标识列表,将节点名称和节点标识对应存储至该节点标识列表中。其中,节点标识可为IP(Internet Protocol,网络之间互联的协议)地址以及其他任一种能够用于标识该节点的信息,表4中仅以IP地址为例进行说明。
表4
节点名称 节点标识
节点1 117.114.151.174
节点2 117.116.189.145
节点N 119.123.789.258
数据共享系统中的每个节点均存储一条相同的区块链。区块链由多个区块组成,参见图18,区块链由多个区块组成,创始块中包括区块头和区块主体,区块头中存储有输入信息特征值、版本号、时间戳和难度值,区块主体中存储有输入信息;创始块的下一区块以创始块为父区块,下一区块中同样包括区块头和区块主体,区块头中存储有当前区块的输入信息特征值、父区块的区块头特征值、版本号、时间戳和难度值,并以此类推,使得区块链中每个区块中存储的区块数据均与父区块中存储的区块数据存在关联,保证了区块中输入信息的安全性。
在生成区块链中的各个区块时,参见图19,区块链所在的节点在接收到输入信息时,对输入信息进行校验,完成校验后,将输入信息存储至内存池中,并更新其用于记录输入信息的哈希树;之后,将更新时间戳更新为接收到输入信息的时间,并尝试不同的随机数,多次进行特征值计算,使得计算得到的特征值可以满足下述公式:
SHA256(SHA256(version+prev_hash+merkle_root+ntime+nbits+x))<TARGET
其中,SHA256为计算特征值所用的特征值算法;version(版本号)为区块链中相关区块协议的版本信息;prev_hash为当前区块的父区块的区块头特征值;merkle_root为输入信息的特征值;ntime为更新时间戳的更新时间;nbits为当前难度,在一段时间内为定值,并在超出固定时间段后再次进行确定;x为随机数;TARGET为特征值阈值,该特征值阈值可以根据nbits确定得到。
这样,当计算得到满足上述公式的随机数时,便可将信息对应存储,生成区块头和区块主体,得到当前区块。随后,区块链所在节点根据数据共享系统中其他节点的节点标识,将新生成的区块分别发送给其所在的数据共享系统中的其他节点,由其他节点对新生成的区块进行校验,并在完成校验后将新生成的区块添加至其存储的区块链中。

Claims (20)

  1. 一种检索模型的训练方法,其中,所述方法由计算机设备执行,所述检索模型包括嵌入向量网络和量化索引网络,所述嵌入向量网络用于获取检索对象的特征向量,所述量化索引网络用于提取所述检索对象的量化索引;所述方法包括:
    获取用于训练所述检索模型的n个样本三元组;所述样本三元组包括训练样本、与所述训练样本构成相似样本对的正样本、以及与所述训练样本不构成相似样本对的负样本,n为大于1的正整数;
    将所述n个样本三元组的基础特征向量输入所述嵌入向量网络;根据所述嵌入向量网络输出的特征向量的误差,筛选出用于训练所述量化索引网络的第一样本三元组集合;
    将所述n个样本三元组的基础特征向量输入所述量化索引网络;根据所述量化索引网络输出的量化索引的误差,筛选出用于训练所述嵌入向量网络的第二样本三元组集合;
    基于所述第一样本三元组集合训练所述量化索引网络,以及基于所述第二样本三元组集合训练所述嵌入向量网络。
  2. 根据权利要求1所述的方法,其中,所述根据所述嵌入向量网络输出的特征向量的误差,筛选出用于训练所述量化索引网络的第一样本三元组集合,包括:
    获取所述嵌入向量网络对所述n个样本三元组输出的n组三元组特征向量;
    计算所述n组三元组特征向量对应的n个第一误差损失;
    在所述n个第一误差损失由小到大的排序结果中,筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于训练所述量化索引网络的第一样本三元组集合,n 1为小于n的正整数。
  3. 根据权利要求2所述的方法,其中,所述计算所述n组三元组特征向量对应的n个第一误差损失,包括:
    针对每组所述三元组特征向量,计算所述训练样本的特征向量和所述正样本的特征向量之间的第一距离;
    针对每组所述三元组特征向量,计算所述训练样本的特征向量和所述负样本的特征向量之间的第二距离;
    计算所述第一距离和所述第二距离之间的差值与第一距离阈值之间的第一误差损失,所述第一距离阈值是所述训练样本与所述正样本之间的距离和所述训练样本与所述负样本之间的距离的差值的阈值。
  4. 根据权利要求2所述的方法,其中,所述在所述n个第一误差损失由小到大的排序结果中,筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于训练所述量化索引网络的第一样本三元组集合,包括:
    在所述n个第一误差损失由小到大的排序结果中,根据预设的x值,筛选出排序在前x%的n 1个第一误差损失所对应的样本三元组,添加至用于训练所述量化索引网络的第一样本三元组集合,x为正数。
  5. 根据权利要求1所述的方法,其中,所述根据所述量化索引网络输出的量化索引的误差,筛选出用于训练所述嵌入向量网络的第二样本三元组集合,包括:
    获取所述量化索引网络对所述n个样本三元组输出的n组三元组量化索引;
    计算所述n组三元组量化索引对应的n个第二误差损失;
    在所述n个第二误差损失由小到大的排序结果中,筛选出排序在第二选取范围内的n 2个第二误差损失所对应的样本三元组,添加至用于训练所述嵌入向量网络的第二样本三元组集合,n 2为小于n的正整数。
  6. 根据权利要求5所述的方法,其中,所述计算所述n组三元组量化索引对应的n个第二误差损失,包括:
    针对每组三元组量化索引,计算所述三元组量化索引的第一三元组损失;
    针对每组三元组量化索引,计算所述三元组量化索引的第一量化误差损失;
    对所述第一三元组损失和所述第一量化误差损失进行加权求和,得到所述第二误差损失。
  7. 根据权利要求5所述的方法,其中,所述在所述n个第二误差损失由小到大的排序结果中,筛选出排序在第二选取范围内的n 2个第二误差损失所对应的样本三元组,添加至用于训练所述嵌入向量网络的第二样本三元组集合,包括:
    在所述n个第二误差损失由小到大的排序结果中,根据预设的y值,筛选出排序在前y%的n 2个第二误差损失所对应的样本三元组,添加至用于训练所述量化索引网络的第二样本三元组集合,y为正数。
  8. 根据权利要求1至7任一所述的方法,其中,所述基于所述第一样本三元组集合训练所述量化索引网络,包括:
    针对所述第一样本三元组集合的每个所述样本三元组,通过所述量化索引网络计算三元组特征向量的第二三元组损失,所述三元组特征向量是所述嵌入向量网络输出的特征向量,所述第一样本三元组集合包括n 1个样本三元组;
    针对所述第一样本三元组集合的每个所述样本三元组,通过所述量化索引网络计算所述三元组特征向量的第二量化误差损失;
    对所述第二三元组损失和所述第二量化误差损失进行加权求和,得到所述第三误差损失;
    基于n 1个所述第三误差损失,训练所述量化索引网络。
  9. 根据权利要求1至7任一所述的方法,其中,所述基于所述第二样本三元组集合训练所述嵌入向量网络,包括:
    针对所述第二样本三元组集合的每个所述样本三元组,通过所述嵌入向量网络计算所述训练样本的特征向量和所述正样本的特征向量之间的第三距离,所述第二样本三元组集合包括n 2个样本三元组;
    针对所述第二样本三元组集合的每个所述样本三元组,通过所述嵌入向量网络计算所述训练样本的特征向量和所述负样本的特征向量之间的第四距离;
    计算所述第三距离和所述第四距离之间的差值与第二距离阈值之间的第四误差损失,所述第二距离阈值是所述训练样本与所述正样本之间的距离和所述训练样本与所述负样本的距离的差值的阈值;
    基于n 2个所述第四误差损失,训练所述嵌入向量网络。
  10. 根据权利要求1所述的方法,其中,所述检索模型还包括基础特征网络;
    所述方法还包括:
    通过所述基础特征网络,获取所述n个样本三元组的基础特征向量。
  11. 根据权利要求1至7任一项的方法,其特征在于,所述检索模型是图像检索模型,所述检索对象是图像。
  12. 一种对象检索方法,其中,所述方法由计算机设备执行,所述方法应用于权利要求1至10任一项训练得到的检索模型,所述方法包括:
    获取查询对象的基础特征向量;
    将所述基础特征向量输入至所述量化索引网络和所述嵌入向量网络;
    通过所述量化索引网络获取所述查询对象的量化索引,以及通过所述嵌入向量网络获取所述查询对象的特征向量;
    基于所述量化索引,从量化码本中索引得到m个候选对象的特征向量;所述量化码本存储有所述量化索引与所述m个候选对象的特征向量之间的映射关系,m为正整数;
    分别计算所述m个候选对象的特征向量与所述查询对象的特征向量的第五距离,得到m个第五距离;
    在所述m个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选对象,z为正数。
  13. 一种检索模型的训练装置,其中,所述检索模型包括嵌入向量网络和量化索引网络,所述嵌入向量网络用于获取检索对象的特征向量,所述量化索引网络用于提取所述检索对象的量化索引;所述装置包括:
    获取模块,用于获取用于训练所述检索模型的n个样本三元组;所述样本三元组包括训练样本、与所述训练样本构成相似样本对的正样本、以及与所述训练样本不构成相似样本对的负样本,n为大于1的正整数;
    筛选模块,用于将所述n个样本三元组的基础特征向量输入所述嵌入向量网络;根据所述嵌入向量网络输出的特征向量,筛选出用于训练所述量化索引网络的第一样本三元组集合;
    筛选模块,还用于将所述n个样本三元组的基础特征向量输入所述量化索引网络;根据所述量化索引网络输出的量化索引,筛选出用于训练所述嵌入向量网络的第二样本三元组集合;
    训练模块,用于基于所述第一样本三元组集合训练所述量化索引网络,以及基于所述第二样本三元组集合训练所述嵌入向量网络。
  14. 根据权利要求13所述的装置,其特征在于,
    所述筛选模块,还用于获取所述嵌入向量网络对所述n个样本三元组输出的n组三元组特征向量;
    所述筛选模块,还用于计算所述n组三元组特征向量对应的n个第一误差损失;
    所述筛选模块,还用于在所述n个第一误差损失由小到大的排序结果中,筛选出排序在第一选取范围内的n 1个第一误差损失所对应的样本三元组,添加至用于训练所述量化索引网络的第一样本三元组集合,n 1为小于n的正整数。
  15. 根据权利要求14所述的装置,其特征在于,
    所述筛选模块,还用于针对每组所述三元组特征向量,计算所述训练样本的特征向量和所述正样本的特征向量之间的第一距离;
    所述筛选模块,还用于针对每组所述三元组特征向量,计算所述训练样本的特征向量和所述负样本的特征向量之间的第二距离;
    所述筛选模块,还用于计算所述第一距离和所述第二距离之间的差值与第一距离阈值之间的第一误差损失,所述第一距离阈值是所述训练样本与所述正样本之间的距离和所述训练样本与所述负样本之间的距离的差值的阈值。
  16. 根据权利要求14所述的装置,其特征在于,
    所述筛选模块,还用于在所述n个第一误差损失由小到大的排序结果中,根据预设的x 值,筛选出排序在前x%的n 1个第一误差损失所对应的样本三元组,添加至用于训练所述量化索引网络的第一样本三元组集合,x为正数。
  17. 一种对象检索装置,其中,所述装置应用于权利要求1至11任一项训练得到的检索模型,所述装置包括:
    获取模块,用于获取查询对象的基础特征向量;
    输入模块,用于将所述基础特征向量输入至所述量化索引网络和所述嵌入向量网络;
    获取模块,还用于通过所述量化索引网络获取所述查询对象的量化索引,以及通过所述嵌入向量网络获取所述查询对象的特征向量;
    索引模块,用于基于所述量化索引,从量化码本中索引得到m个候选对象的特征向量;所述量化码本存储有所述量化索引与所述m个候选对象的特征向量之间的映射关系,m为正整数;
    计算模块,用于分别计算所述m个候选对象的特征向量与所述查询对象的特征向量的第五距离,得到m个第五距离;
    筛选模块,用于在所述m个第五距离由小到大的排序结果中,根据预设的z值,筛选出排序在前z%的第五距离对应的候选对象,z为正数。
  18. 一种计算机设备,其中,所述计算机设备包括:处理器和存储器,所述存储器存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现如权利要求1至11任一所述的检索模型的训练方法,和/或,权利要求12所述对象检索方法。
  19. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至11任一所述的检索模型的训练方法,和/或,权利要求12所述对象检索方法。
  20. 一种计算机程序产品,包括计算机程序或指令,其中,所述计算机程序或指令被处理器执行时实现如权利要求1至11任一所述的检索模型的训练方法,和/或,权利要求12所述对象检索方法。
PCT/CN2022/107973 2021-08-17 2022-07-26 检索模型的训练和检索方法、装置、设备及介质 Ceased WO2023020214A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22857527.0A EP4386579B1 (en) 2021-08-17 2022-07-26 Retrieval model training method and apparatus, retrieval method and apparatus, device and medium
US18/134,447 US12493649B2 (en) 2021-08-17 2023-04-13 Method and apparatus for training retrieval model, retrieval method and apparatus, device and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110945262.5 2021-08-17
CN202110945262.5A CN114282035B (zh) 2021-08-17 2021-08-17 图像检索模型的训练和检索方法、装置、设备及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/134,447 Continuation US12493649B2 (en) 2021-08-17 2023-04-13 Method and apparatus for training retrieval model, retrieval method and apparatus, device and medium

Publications (1)

Publication Number Publication Date
WO2023020214A1 true WO2023020214A1 (zh) 2023-02-23

Family

ID=80868404

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107973 Ceased WO2023020214A1 (zh) 2021-08-17 2022-07-26 检索模型的训练和检索方法、装置、设备及介质

Country Status (4)

Country Link
US (1) US12493649B2 (zh)
EP (1) EP4386579B1 (zh)
CN (1) CN114282035B (zh)
WO (1) WO2023020214A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912535A (zh) * 2023-09-08 2023-10-20 中国海洋大学 一种基于相似筛选的无监督目标重识别方法、装置及介质
CN117113292A (zh) * 2023-08-30 2023-11-24 中国电信股份有限公司 特征量化方法、装置、非易失性存储介质及电子设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10425353B1 (en) 2017-01-27 2019-09-24 Triangle Ip, Inc. Machine learning temporal allocator
CN114282035B (zh) 2021-08-17 2024-12-31 腾讯科技(深圳)有限公司 图像检索模型的训练和检索方法、装置、设备及介质
CN113889146B (zh) * 2021-09-22 2025-05-27 北京小米移动软件有限公司 音频识别方法、装置、电子设备和存储介质
US12579411B2 (en) * 2021-12-29 2026-03-17 International Business Machines Corporation Resonator network based neural network
CN117152464A (zh) * 2022-05-18 2023-12-01 广州视源电子科技股份有限公司 图像检索模型的训练方法、图像检索方法和相关设备
CN115795078B (zh) * 2022-12-08 2026-04-14 北京沃东天骏信息技术有限公司 图像检索模型的训练方法、图像检索方法及装置
CN116259328B (zh) * 2023-02-24 2025-10-31 思必驰科技股份有限公司 用于音频降噪的后训练量化方法、装置和存储介质
CN117133031A (zh) * 2023-07-24 2023-11-28 厦门真景科技有限公司 一种姿态鲁棒的人脸识别方法和装置以及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341805A1 (en) * 2015-11-06 2018-11-29 Thomson Licensing Method and Apparatus for Generating Codebooks for Efficient Search
CN109815355A (zh) * 2019-01-28 2019-05-28 网易(杭州)网络有限公司 图像搜索方法及装置、存储介质、电子设备
CN113127672A (zh) * 2021-04-21 2021-07-16 鹏城实验室 量化图像检索模型的生成方法、检索方法、介质及终端
CN113254687A (zh) * 2021-06-28 2021-08-13 腾讯科技(深圳)有限公司 图像检索、图像量化模型训练方法、装置和存储介质
CN114282035A (zh) * 2021-08-17 2022-04-05 腾讯科技(深圳)有限公司 图像检索模型的训练和检索方法、装置、设备及介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021364B (zh) * 2016-05-10 2017-12-12 百度在线网络技术(北京)有限公司 图片搜索相关性预测模型的建立、图片搜索方法和装置
CN106909625A (zh) * 2017-01-20 2017-06-30 清华大学 一种基于Siamese网络的图像检索方法及系统
CN106980641B (zh) * 2017-02-09 2020-01-21 上海媒智科技有限公司 基于卷积神经网络的无监督哈希快速图片检索系统及方法
CN107423376B (zh) * 2017-07-10 2019-12-27 上海媒智科技有限公司 一种有监督深度哈希快速图片检索方法及系统
CN111177435B (zh) * 2019-12-31 2023-03-31 重庆邮电大学 一种基于改进pq算法的cbir方法
CN111782834A (zh) 2020-06-28 2020-10-16 北京百度网讯科技有限公司 图像检索的方法、装置、设备及计算机可读存储介质
CN116051917B (zh) * 2021-10-28 2024-10-18 腾讯科技(深圳)有限公司 一种训练图像量化模型的方法、检索图像的方法及装置
CN116527863A (zh) * 2022-04-28 2023-08-01 腾讯科技(深圳)有限公司 基于虚拟现实的视频生成方法、装置、设备及介质
CN118037956A (zh) * 2024-02-07 2024-05-14 北京清影机器视觉技术有限公司 固定空间内三维虚拟实景生成方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341805A1 (en) * 2015-11-06 2018-11-29 Thomson Licensing Method and Apparatus for Generating Codebooks for Efficient Search
CN109815355A (zh) * 2019-01-28 2019-05-28 网易(杭州)网络有限公司 图像搜索方法及装置、存储介质、电子设备
CN113127672A (zh) * 2021-04-21 2021-07-16 鹏城实验室 量化图像检索模型的生成方法、检索方法、介质及终端
CN113254687A (zh) * 2021-06-28 2021-08-13 腾讯科技(深圳)有限公司 图像检索、图像量化模型训练方法、装置和存储介质
CN114282035A (zh) * 2021-08-17 2022-04-05 腾讯科技(深圳)有限公司 图像检索模型的训练和检索方法、装置、设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4386579A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113292A (zh) * 2023-08-30 2023-11-24 中国电信股份有限公司 特征量化方法、装置、非易失性存储介质及电子设备
CN116912535A (zh) * 2023-09-08 2023-10-20 中国海洋大学 一种基于相似筛选的无监督目标重识别方法、装置及介质
CN116912535B (zh) * 2023-09-08 2023-11-28 中国海洋大学 一种基于相似筛选的无监督目标重识别方法、装置及介质

Also Published As

Publication number Publication date
CN114282035B (zh) 2024-12-31
EP4386579A1 (en) 2024-06-19
CN114282035A (zh) 2022-04-05
EP4386579B1 (en) 2026-04-22
US12493649B2 (en) 2025-12-09
EP4386579A4 (en) 2024-11-20
US20230252070A1 (en) 2023-08-10

Similar Documents

Publication Publication Date Title
WO2023020214A1 (zh) 检索模型的训练和检索方法、装置、设备及介质
CN113095370B (zh) 图像识别方法、装置、电子设备及存储介质
CN114565807B (zh) 训练目标图像检索模型的方法和装置
WO2020094060A1 (zh) 推荐方法及装置
WO2021159769A1 (zh) 图像检索方法、装置、存储介质及设备
CN112231592B (zh) 基于图的网络社团发现方法、装置、设备以及存储介质
EP4664416A1 (en) Training method and apparatus for liveness detection model, and medium, electronic device and product
CN111339443A (zh) 用户标签确定方法、装置、计算机设备及存储介质
CN111709473B (zh) 对象特征的聚类方法及装置
CN110909817B (zh) 分布式聚类方法及系统、处理器、电子设备及存储介质
WO2023108995A1 (zh) 向量相似度计算方法、装置、设备及存储介质
CN114358109A (zh) 特征提取模型训练、样本检索方法、装置和计算机设备
CN113704528B (zh) 聚类中心确定方法、装置和设备及计算机存储介质
WO2021139351A1 (zh) 图像分割方法、装置、介质及电子设备
CN113240128B (zh) 数据不平衡的协同训练方法、装置、电子设备及存储介质
CN111767419A (zh) 图片搜索方法、装置、设备及计算机可读存储介质
WO2024066927A1 (zh) 图像分类模型的训练方法、装置及设备
CN118114123B (zh) 识别模型的处理方法、装置、计算机设备和存储介质
CN118097293A (zh) 基于残差图卷积网络和自注意力的小样本数据分类方法及系统
CN114821657A (zh) 模型的训练方法、行人重识别方法和装置、设备、介质
CN119415721B (zh) 跨模态检索方法及装置、电子设备、存储介质
CN113011580B (zh) 一种嵌入表征的处理方法以及相关设备
CN120470541B (zh) 基于最优传输算法的多源数据融合方法及系统
CN111340082B (zh) 数据处理方法及装置、处理器、电子设备、存储介质
HK40098956B (zh) 特徵提取模型的训练方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22857527

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022857527

Country of ref document: EP

Effective date: 20240312

NENP Non-entry into the national phase

Ref country code: DE