WO2020228525A1 - 地点识别及其模型训练的方法和装置以及电子设备 - Google Patents
地点识别及其模型训练的方法和装置以及电子设备 Download PDFInfo
- Publication number
- WO2020228525A1 WO2020228525A1 PCT/CN2020/087308 CN2020087308W WO2020228525A1 WO 2020228525 A1 WO2020228525 A1 WO 2020228525A1 CN 2020087308 W CN2020087308 W CN 2020087308W WO 2020228525 A1 WO2020228525 A1 WO 2020228525A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- model
- cnn model
- vector
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/587—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2134—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
- G06F18/21343—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis using decorrelation or non-stationarity, e.g. minimising lagged cross-correlations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
- G06V10/7747—Organisation of the process, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- This application relates to the field of computer technology, in particular to a method and device for location recognition and its model training, computer-readable storage medium and electronic equipment.
- Place Recognition has been more and more widely used.
- a map application the same location can be distinguished through location recognition, so as to correct possible location and location errors in the map creation process.
- image segments can be classified through location recognition, and the video can be abstracted and segmented accordingly to extract the essence of the video.
- location recognition can also be used for the Augmented Reality (AR) function of various mobile terminal applications. When the user uses the mobile terminal to shoot the scenery they see, the corresponding scenery name can be determined through the location recognition, and then the corresponding Introduction, AR browsing function.
- AR Augmented Reality
- Location identification mainly faces three challenges: condition changes, perspective changes and efficiency requirements. In order to cope with these difficulties, the industry has developed three types of implementation methods.
- the first type of method is based on a manually designed descriptor (Descriptor) for feature extraction of location images; this method has strong robustness to changes in perspective, but cannot automatically adjust to changes in application scenarios.
- Descriptor manually designed descriptor
- the second method is to use a pre-trained Convolutional Neural Network (CNN) as the feature extractor of the location image; this method improves the ability to fight against changes in conditions compared with the previous method, but due to the CNN model used It was originally pre-trained in other areas, so the performance improvement is limited.
- CNN Convolutional Neural Network
- the third method is to directly use location recognition as the training target.
- this type of algorithm significantly improves the robustness of location recognition to changes in conditions and perspectives.
- the image features obtained usually have higher dimensions, the cost of calculation processing is higher, and it is often difficult to meet the efficiency requirements of location recognition.
- This application provides a method and device for location recognition and its model training, computer-readable storage medium, and electronic equipment.
- a model training method for location recognition includes: extracting local features of a sample image based on a first part of a CNN model, and the sample image includes at least one group of images taken at the same location. Multiple images; aggregate the local features into a feature vector with the first dimension based on the second part of the CNN model; obtain a compressed representation vector of the feature vector based on the third part of the CNN model, the The compressed representation vector has a second dimension that is smaller than the first dimension; and to minimize the distance between the compressed representation vectors corresponding to the multiple images taken at the same place as the goal, adjust the first part, The model parameters of the second part and the third part until a CNN model that satisfies a preset condition is obtained.
- a location recognition method including: extracting a compressed representation vector from a collected image using a CNN model, the CNN model being trained according to the above-mentioned model training method for location recognition; and The extracted compressed representation vector performs location recognition.
- a model training device for location recognition which includes: a feature extraction module configured to extract local features of a sample image based on the first part of the CNN model, the sample image including at least one group in the same Multiple images taken at a location; a feature aggregation module configured to aggregate the local features into a feature vector with a first dimension based on the second part of the CNN model; a feature compression module configured to be based on the CNN model
- the third part obtains the compressed representation vector of the feature vector, the compressed representation vector has a second dimension smaller than the first dimension; and the model training module is set to make the multiple photographed at the same place
- the distance between the compressed representation vectors corresponding to the image is minimized as a goal, and the model parameters of the first part, the second part, and the third part are adjusted until a CNN model that meets a preset condition is obtained.
- a location recognition device including: an extraction module configured to extract a compressed representation vector from a collected image using a CNN model, the CNN model according to the above-mentioned model training method for location recognition Obtained through training; and a recognition module configured to perform location recognition based on the extracted compressed representation vector.
- a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned model training method for location recognition or the above-mentioned model training method is provided.
- Location identification method when the computer program is executed by a processor, the above-mentioned model training method for location recognition or the above-mentioned model training method is provided.
- an electronic device including: a processor; and a memory on which computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the above A model training method for location recognition or a location recognition method as described above.
- Fig. 1 shows a schematic diagram of an exemplary system architecture to which the method or device for model training or the method or device for location identification can be applied to the embodiments of the present application.
- Fig. 2 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
- Fig. 3 is a flow chart showing a model training method for location recognition according to an exemplary embodiment.
- Fig. 4 is a flowchart showing a model training method for location recognition according to another exemplary embodiment.
- FIG. 5 is a schematic diagram of the basic network structure of the embodiment shown in FIG. 4.
- FIG. 6 is a schematic flowchart of step 490 in the embodiment shown in FIG. 4.
- 7-8 exemplarily show the performance comparison of the location recognition model in the embodiment of the present application and related technologies.
- Fig. 9 is a flow chart showing a method for location identification according to an exemplary embodiment.
- FIG. 10 is a schematic implementation scene diagram of step 920 in the embodiment shown in FIG. 9.
- Fig. 11 is a block diagram showing a model training device for location recognition according to an exemplary embodiment.
- Fig. 12 is a block diagram showing a location identification device according to an exemplary embodiment.
- FIG. 1 shows a schematic diagram of an exemplary system architecture 100 of a model training method or device for location recognition, or a location recognition method or device to which an embodiment of the present application can be applied.
- the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104 and a server 105.
- the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
- the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables.
- the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks and servers.
- the server 105 may be a server cluster composed of multiple servers.
- the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
- the terminal devices 101, 102, and 103 may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and so on.
- the server 105 may be a server that provides various services.
- the user uploads a sample image sequence to the server 105 by using the terminal device 103 (or the terminal device 101 or 102).
- the sample image sequence includes at least one group of multiple images taken at the same place; the server 105 can be based on the above sample Image sequence, extract the local features of the sample image based on the first part of the CNN model; aggregate the local features into a feature vector with the first dimension based on the second part of the CNN model; based on the third part of the CNN model Obtain a compressed representation vector of the feature vector, where the compressed representation vector has a second dimension smaller than the first dimension; and aiming to minimize the distance between the compressed representation vectors corresponding to the multiple images , Adjusting the model parameters of the first part, the second part, and the third part until a CNN model that satisfies a preset condition is obtained.
- the user uses the terminal device 101 (or terminal device 102 or 103) to take an image at a certain place and upload it to the server 105; the server 105 uses the aforementioned trained CNN model to extract a compressed representation vector for the image, and based The extracted compressed representation vector is used for location recognition.
- the model training method or location recognition method for location recognition provided by the embodiments of the present application is generally executed by the server 105. Accordingly, the model training device or location recognition device for location recognition is generally set on the server. 105 in. In other embodiments, some terminals may have functions similar to those of a server to execute the method. Therefore, the model training method or the location recognition method for location recognition provided in the embodiments of the present application are not limited to be executed on the server side.
- Fig. 2 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
- the computer system 200 includes a central processing unit (CPU) 201, which can be based on a program stored in a read only memory (ROM) 202 or a program loaded from a storage part 208 into a random access memory (RAM) 203 And perform various appropriate actions and processing.
- CPU 201 central processing unit
- ROM read only memory
- RAM random access memory
- various programs and data required for system operation are also stored.
- the CPU 201, ROM 202, and RAM 203 are connected to each other through a bus 204.
- An input/output (I/O) interface 205 is also connected to the bus 204.
- the following components are connected to the I/O interface 205: an input part 206 including a keyboard, a mouse, etc.; an output part 207 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage part 208 including a hard disk, etc. ; And a communication section 209 including a network interface card such as a LAN card, a modem, etc.
- the communication section 209 performs communication processing via a network such as the Internet.
- the drive 210 is also connected to the I/O interface 205 as needed.
- a removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 210 as needed, so that the computer program read from it is installed into the storage section 208 as needed.
- the process described below with reference to the flowchart can be implemented as a computer software program.
- the embodiments of the present application include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
- the computer program may be downloaded and installed from the network through the communication section 209, and/or installed from the removable medium 211.
- the central processing unit (CPU) 201 various functions defined in the system of the present application are executed.
- the computer-readable medium shown in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
- the computer-readable storage medium may be, for example, but not limited to, an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
- the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
- the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF, etc., or any suitable combination of the above.
- each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions.
- the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
- each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be It is realized by a combination of dedicated hardware and computer instructions.
- the units involved in the embodiments described in the present application can be implemented in software or hardware, and the described units can also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
- this application also provides a computer-readable medium.
- the computer-readable medium may be included in the electronic device described in the above-mentioned embodiments; or it may exist alone without being assembled into the electronic device. in.
- the above-mentioned computer-readable medium carries one or more programs.
- the electronic device realizes the method described in the following embodiments. For example, the electronic device can implement the steps shown in FIGS. 3 to 6.
- Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technologies and software-level technologies.
- Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- Computer vision is a science that studies how to make machines "see”. More specifically, it refers to the use of cameras and computers instead of human eyes to recognize, track, and measure objects, and then perform graphics processing to make computer processing It becomes more suitable for human eyes to observe or transmit to the instrument to detect the image.
- Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, Optical Character Recognition (ORC), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual Technologies such as reality, augmented reality, synchronized positioning and map construction also include common facial recognition, fingerprint recognition and other biometric recognition technologies.
- Machine learning is a multi-disciplinary interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
- Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
- Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning techniques.
- CNN is a multi-layer supervised learning neural network, commonly used to deal with image-related machine learning problems.
- a typical CNN consists of a convolution layer (Convolution), a pooling layer (Pooling) and a fully connected layer (Fully Connection).
- the low hidden layer is generally composed of a convolutional layer and a pooling layer alternately.
- the role of the convolutional layer is to enhance the original signal characteristics of the image and reduce noise through convolution operations.
- the role of the pooling layer is to determine the local correlation of the image. The principle reduces the amount of calculation while maintaining the image rotation invariance.
- the fully connected layer is located at the upper level of CNN.
- Its input is the feature image obtained by the feature extraction of the convolutional layer and the pooling layer, and the output can be connected to the classifier, through the use of logistic regression, Softmax regression, or Support Vector Machine (Support Vector Machine). , SVM) classify the input image.
- logistic regression Softmax regression
- Support Vector Machine Support Vector Machine
- the CNN training process generally uses the gradient descent method to minimize the loss function.
- the training sample set of CNN usually consists of vector pairs in the form of "input vector, ideal output vector".
- the weight parameters of all layers of the network can be initialized with some different small random numbers. Since CNN can be regarded as a kind of input-to-output mapping in essence, it can learn a large number of mapping relationships between input and output without requiring any precise mathematical expressions between input and output, so it can use known vector pairs
- the composed training sample set trains the CNN so that it has the ability to map between input and output pairs.
- location recognition is often used for loop detection and image-based localization in simultaneous localization and map construction (Simultaneous Localization and Mapping, SLAM).
- the estimation of pose is often a recursive process, that is, the pose of the current frame is calculated from the pose of the previous frame, so that the transmission of frame by frame will inevitably cause cumulative errors.
- the key to loop detection is how to effectively detect that the camera passes through the same place, which is related to the correctness of the estimated trajectory and the map over a long period of time. Since loopback detection provides the correlation between current data and all historical data, it can greatly reduce the cumulative error generated by the SLAM front-end and build a geometrically consistent map. Location recognition is to identify whether the camera is back to the same place in the loop detection. Since the loop detection has the effect of correcting the cumulative error of the visual SLAM front end, it can be applied to AR-related applications to correct the problems of inaccurate pose and loss of positioning caused by the long-term operation of the visual odometer.
- Image-based positioning is to obtain the corresponding geographic location based on the image, and its application scenarios are also very broad. For example, a picture taken by the terminal can be uploaded to an image database or a search engine marked with a geographic location, and a high-precision geographic location corresponding to the photographer can be obtained through location recognition technology. Image-based positioning, for example, can play a role in places with weak GPS signals or complex terrain. In this case, the positioning of the mobile phone will inevitably be deviated, so that you can use the mobile phone to take a picture of the current location and use location recognition technology to obtain accurate positioning .
- the purpose of location recognition is to identify the spatial location corresponding to the query image.
- the location identification uses the image feature extractor to project these images into the feature space, and then calculate the similarity between the image feature of the image to be queried and the sample image in the database. If the similarity between the query image and the most similar image in the database satisfies a certain threshold, it is considered that the position of the image in the database is the position of the image to be queried. Therefore, the most critical part of location recognition is to obtain an appropriate image feature extractor.
- the construction of the image feature extractor is usually modeled as an example retrieval problem, which mainly includes three steps. First, extract the local descriptors of the image; then, aggregate the local descriptors into feature vectors with fixed dimensions; finally, compress the feature vectors to appropriate dimensions.
- the training-based location recognition algorithms in related technologies are only trained for the first two steps, and the final feature compression step is only used as a post-processing process after the model training is completed.
- the feature dimension of the image output by the model is very high, which will cause two problems.
- the second is to directly use high-dimensional image features to calculate the similarity between images, which is too expensive to calculate.
- the embodiments of the present application provide a method and device for location recognition and model training thereof, a computer-readable storage medium, and electronic equipment.
- Fig. 3 is a flow chart showing a model training method for location recognition according to an exemplary embodiment. As shown in Fig. 3, the model training method can be executed by any computing device, and can include the following steps 310-370.
- step 310 the local features of the sample image are extracted based on the first part of the CNN model.
- the sample image here includes at least one group of multiple images taken at the same place.
- the sample image includes multiple groups of images taken at different locations, and each group includes multiple images taken at the same location.
- the purpose of location recognition is to identify the spatial location corresponding to the query image. Therefore, the sample images used to train the model may carry marked geographic location information, such as GPS information.
- multiple images in the sample image that are the same as the shooting location may be marked as positive samples, and multiple images that are different from the shooting location may be marked as negative samples.
- the training process of the location recognition model is to continuously adjust the model parameters to minimize the distance between the vector representation of each image in the sample image and the positive sample vector representation of the final model, and the distance between the vector representation of the negative sample The distance meets the preset boundary.
- the same and different shooting locations described in the embodiments of the present application are just for ease of description, and do not mean that the location information of the images is completely consistent.
- the same shooting location means that the difference between the geographic location information (such as GPS information) of the two images is less than the first preset value, and the different shooting location refers to the geographic location information (such as GPS information) of the two images. The difference between the information) is greater than the second preset value.
- the feature extraction here is the primary operation in image processing.
- feature extraction is the first arithmetic processing performed on an image, which is mainly used to determine whether each pixel represents a feature.
- the input sample image can also be smoothed in the scale space through the Gaussian blur kernel, and then one or more features of the image are calculated through the local derivative operation.
- the local features are, as a whole, some differences in the image from its surroundings. Local features usually describe specific areas in an image, so that the image can be highly distinguishable. Therefore, the above-mentioned feature extraction is essentially to extract the local features of the image, and the result directly determines the performance of subsequent image classification and recognition.
- the first type of location recognition implementation mentioned in the background technology section is to extract features from images based on manually designed descriptors.
- a part of the trainable CNN model is used to implement feature extraction of the sample image to obtain a local descriptor of the sample image.
- step 330 the local features are aggregated into a feature vector with the first dimension based on the second part of the CNN model.
- the image to be queried needs to be compared with the massive images in the database.
- the local descriptor obtained in step 310 is already a feature representation of the sample image, even if each descriptor only needs a few bits in size, considering the number of descriptors corresponding to each image and the number of images in the database, directly based on the local description It is difficult to realize the location recognition of the image to be queried in a short enough time. Therefore, in the embodiment of the present application, the local descriptors are aggregated in step 330, and the goal is to aggregate these descriptors into a vector of a specific dimension.
- BoW Bag-of-words
- FV Fisher Vector
- VLAD Vector of Locally Aggregated Descriptors
- the core idea of the BoW method is to extract key point descriptors and train a codebook by clustering, and then represent the picture based on the number of times each descriptor vector appears in each center vector in the codebook.
- the core idea of the FV method is to use a Gaussian mixture model to represent each image by calculating the mean value, covariance and other parameters in the model.
- VLAD is a description pooling method widely used in example retrieval and image classification. It is used to capture the statistical information of local features in the image aggregated in the image. It is different from the number of occurrences of the BoW record vector.
- VLAD records each description The residual sum of the sub-vectors. The following uses VLAD as an example to describe the general process of local feature aggregation.
- V is a K ⁇ D-dimensional image description vector.
- V is a K ⁇
- the matrix of D convert the matrix into a vector representation, and then normalize, the calculation formula is as follows:
- x i (j) and c k (j) represent the i-th local descriptor and the j-th feature value of the k-th cluster center, respectively.
- a k (x i ) can be understood as the weight of the i-th local feature belonging to the k-th cluster; in other words, if the value of a k (x i ) is 1, it means that the feature belongs to the cluster of this cluster, otherwise it is 0 Does not belong.
- V(j,k) represents the sum of the residuals (x i -c k ) of all local features on each cluster.
- step 330 uses a part of a trainable CNN model to implement local feature aggregation, and aggregates the local features obtained in step 310 into a feature vector having the first dimension.
- step 350 the compressed representation vector of the feature vector is obtained based on the third part of the CNN model.
- Compression here means that the vector has a second dimension smaller than the first dimension.
- the feature vector obtained after performing local feature aggregation based on step 320 usually still has a relatively high dimensionality, and it is difficult to meet the efficiency requirements of location recognition, and it is easy to fall into a dimensional disaster and cause the generalization performance of image features to decrease.
- the embodiment of the present application performs dimensionality reduction processing on the above-mentioned feature vector in step 350.
- dimensionality reduction is a preprocessing method for high-dimensional feature data.
- the purpose is to remove noise and unimportant features from high-dimensional data, and retain some of the most important features, so as to improve the data processing speed. purpose.
- dimensionality reduction can save a lot of processing time while keeping information loss within a certain range.
- Related dimensionality reduction algorithms include Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Factor Analysis (FA), Independent Component Analysis (ICA) and so on.
- PCA Principal Component Analysis
- FA Factor Analysis
- ICA Independent Component Analysis
- a trainable VLAD network is used for local feature aggregation, as described above, for each image, it outputs an image description vector of K ⁇ D dimensions (that is, the first dimension).
- the feature representation matrix corresponding to the entire image set is X ⁇ R (K ⁇ D) ⁇ M .
- the goal of PCA is to obtain a compressed representation vector of dimension L (that is, the second dimension).
- L the dimension
- the first L principal components corresponding to X can be expressed as a matrix T ⁇ R (K ⁇ D) ⁇ L ; finally, multiply the transposition T T of the matrix T by the matrix X to obtain the compressed representation of the entire image set Y ⁇ R L ⁇ M .
- step 350 uses a part of the trainable CNN model to implement dimensionality reduction processing, and reduces the dimensionality of the feature vector obtained in step 330 to a compressed representation vector with a smaller dimension.
- step 370 with the goal of minimizing the distance between the compressed representation vectors corresponding to multiple images taken at the same location, the CNN model is adjusted for the first part, second part, and third part. Model parameters until a CNN model that meets the preset conditions is obtained.
- the CNN model here can adopt a typical network structure.
- the first part, the second part and the third part mentioned above can respectively include a convolutional layer, a pooling layer, a fully connected layer, and a Softmax layer. Or multiple layers.
- the first part may include a convolutional layer and a pooling layer
- the second part may include a Softmax layer
- the third part may include a fully connected layer.
- the sample image of step 310 is input into the model, and the corresponding compressed representation vector can be extracted through steps 330, 350 and 370 ;
- the joint loss calculated based on the Softmax layer is backpropagated through the model, and the parameters of the convolutional layer and the fully connected layer can be updated, and the sample image is re-input to the updated model of the parameters, and iteratively until the preset convergence conditions are met,
- a well-trained CNN model can be obtained.
- an end-to-end training location recognition model can be truly realized.
- the resulting CNN model can directly obtain low-dimensional image features, thereby improving The performance of location recognition.
- Fig. 4 is a flowchart showing a model training method for location recognition according to another exemplary embodiment. As shown in Fig. 4, the model training method can be executed by any computing device, and can include the following steps 410-490.
- step 410 a sample image set is constructed.
- step 410 may use a public image database to construct a sample image set, for example, including but not limited to Pitts250k, Pitts30k, TokyoTM and so on.
- Pitts250k contains 250k database images collected from Google Street View and 24k query images generated from Street View, which were taken at different times several years apart.
- the image set can be divided into three roughly equal parts for training, verification, and testing. Each part contains about 83k database images and 8k query images, and there is no intersection between the three parts.
- Pitts30k is a subset of Pitts250k, which is used by many algorithms because it helps speed up training.
- the image set also consists of three parts, which are used for training, verification and testing. Each group contains 10k database images, and there is no geographic intersection.
- TokyoTM is obtained by collecting Google Street View panoramas and cutting each panorama into 12 images with different perspectives. It also contains photos taken at different times and at the same place. Therefore, TokyoTM is suitable for evaluating the robustness of location recognition algorithms against changes in conditions and viewing angles. It contains two parts: training set and validation set.
- step 430 local features of the sample image are extracted.
- step 430 may use VGGNet to extract local features of the extracted sample image.
- VGGNet is a deep CNN structure developed by the computer vision team of Oxford University and Google DeepMind. It repeatedly stacks 3*3 small convolution kernels and 2*2 maximum pooling layers to build a depth up to 19-layer CNN structure.
- VGGNet uses all 3*3 convolution kernels and 2*2 pooling kernels to improve performance by continuously deepening the network structure. Since the parameter quantity is mainly concentrated in the last three fully connected layers, the increase in the number of network layers will not bring about an explosion in the parameter quantity. At the same time, the concatenation of two 3*3 convolutional layers is equivalent to a 5*5 convolutional layer, and the concatenation of three 3*3 convolutional layers is equivalent to a 7*7 convolutional layer.
- the size of the receptive field of three 3*3 convolutional layers is equivalent to a 7*7 convolutional layer, but the parameter of the former is only about half of the latter, and the former can have 3 nonlinear operations, while the latter has only 1 non-linear operation, so that the former has a stronger ability to learn features.
- VGGNet also increases the linear transformation by using a 1*1 convolutional layer, and the number of output channels has not changed.
- the 1*1 convolutional layer is often used to extract features, that is, the features of multiple channels are combined together to condense the output of a larger channel or a smaller channel, and the size of each picture remains unchanged.
- the 1*1 convolutional layer can also be used to replace the fully connected layer.
- VGGNet contains many levels of networks, ranging in depth from 11 to 19 layers. The more commonly used ones are VGGNet-16 and VGGNet-19. VGGNet divides the network into 5 segments, each segment consists of multiple 3*3 convolutional networks connected in series, each segment is convolved followed by a maximum pooling layer, and the last is 3 fully connected layers and a softmax layer.
- the first part of the CNN model in the embodiment of this application can be implemented based on VGGNet.
- the last layer in the above-mentioned VGGNet basic network may be removed.
- step 450 local features are aggregated into feature vectors.
- step 450 may use the improved netVLAD to perform local feature aggregation.
- netVLAD uses an approximate method to perform soft assignment on the weight parameter a k (x i ), as shown in the following formula:
- the above weight distribution can be regarded as a fuzzy clustering distribution method, and a probability function weight is generated according to the distance of each local feature to the cluster center.
- a probability function weight is generated according to the distance of each local feature to the cluster center.
- For a local feature descriptor x i its weight range under each cluster is between 0-1.
- the highest weight can be understood as the feature closest to the center of the cluster, and a low weight means that it is close to The cluster center is far away. It can be noted that when ⁇ approaches positive infinity + ⁇ , equation (2) represents the original VLAD structure.
- the final VLAD feature vector can be obtained as:
- NetVLAD can effectively aggregate the statistics of first-order residuals of different parts (clusters) in the local feature space through the above-mentioned soft allocation method on different clusters.
- NetVLAD contains three parameters w k , b k , and c k , which makes NetVLAD more flexible than the traditional VLAD method with only one parameter c k , and all parameters are under specific tasks. It can be learned end-to-end.
- step 470 the dimensionality reduction process is performed on the feature vector to obtain the corresponding compressed representation vector.
- step 470 may use the following NetPCA first proposed by this application to perform local feature aggregation.
- the embodiment of this application proposes for the first time that a neural network is used to simulate the function of PCA, namely NetPCA.
- the core idea of NetPCA is to project an image into an orthogonal feature space so that each element of the image representation is linearly independent, thereby greatly compressing the redundant information in the image representation.
- NetPCA obtains the direction of the projection matrix through end-to-end training.
- NetPCA is set as a fully connected layer in the entire CNN model.
- This layer is used to receive the feature vector input obtained in step 450 and has a preset number of neurons.
- the number of neurons is equal to the compression target dimension L (ie, the second dimension) of the feature vector, so it can be set according to requirements.
- the weight of each neuron is constrained to be a unit vector, and the weights between the neurons satisfy an orthogonal relationship, thereby ensuring that the compressed image features are in the unit orthogonal space.
- NetPCA can project the feature vector obtained by 450 into the unit orthogonal feature space to obtain a compressed representation vector with the target dimension.
- Figure 5 shows the basic network architecture of the CNN model corresponding to steps 430-470.
- the image 501 first goes through VGGNet502 to extract the local features of the image; then through the local feature aggregation of NetVLAD 503, the K ⁇ D-dimensional feature vector representation is obtained; and then through the dimensionality reduction process of the NetPCA fully connected layer 504, Finally, the L-dimensional compressed representation vector is output.
- the image 501 here is a sample image in the model training process of this embodiment, and is the image to be queried in the subsequent model application process (ie, the location recognition application process).
- step 490 a CNN model meeting preset conditions is obtained through model parameter training.
- the trainable parameters include the weight matrices of three parts: VGGNet, NetVLAD, and NetPCA.
- the embodiment of the application constructs a reasonable loss function.
- the aforementioned sample image includes a first image, a plurality of second images whose shooting locations are the same as the first image, and a plurality of third images whose shooting locations are different from the first image, and the aforementioned feature vector includes corresponding to the first image.
- an image in the sample image whose geographic location is less than the first threshold (which can be regarded as the same shooting location) can be set as a potential positive sample
- the image whose geographic location is greater than the second threshold (which can be regarded as a different shooting location) is set as a negative sample
- the training target of the model can be designed such that for each sample image, such a compressed representation vector can be output so that q is the best matching image
- q is the best matching image
- a triple ordering loss function can be defined as shown in the following formula:
- L is the loss function
- m is the boundary constant
- the loss function is designed to be related to the negative sample image The sum of individual losses. For each negative sample image, if the distance between the specific image q and it is greater than the specific image q and the best matching image The distance between and the difference exceeds the preset boundary, the loss is zero. Conversely, if the difference does not meet the preset boundary, the loss is proportional to the difference.
- orthogonal constraints in order to constrain the weights between the neurons in NetPCA to satisfy the orthogonal relationship, orthogonal constraints can be further added to the above loss function, and the orthogonal constraints should be passed through each neuron.
- the weight matrix is derived from the known unit vector.
- orthogonal constraint item G is as follows:
- W is the weight matrix of the neuron
- T is the matrix transposition
- E is the known unit vector
- g is the square of each element of the matrix
- sum is the sum.
- the optimal projection direction of NetPCA for the feature vector can be determined through end-to-end training.
- the weight W of the neuron is the determined optimal projection direction.
- step 490 can use a standard stochastic gradient descent algorithm (SGD, Stochastic Gradient Descent) for CNN training.
- SGD stochastic gradient descent algorithm
- An example is shown in Figure 6, including the following Steps 610-650.
- step 610 the loss function is backpropagated through the CNN model to update the model parameters of the CNN model.
- the weight parameter matrix of the convolutional layer and the fully connected layer is used in the CNN training process Can be updated based on loss-based back propagation.
- the weight parameter matrix in the first part for local feature extraction, the second part for feature aggregation, and the third part for dimensionality reduction processing In the training process of CNN, it can be updated based on the back propagation of loss.
- the weight parameter matrix of the convolutional layer and the fully connected layer can be initialized with some different small random numbers.
- the convolution kernels of all convolution layers can be initialized according to a Gaussian distribution with 0 as the mean and 0.01 as the variance.
- step 630 the loss function is recalculated based on the CNN model with updated parameters.
- the above steps 430-470 can be executed again to perform the extraction and aggregation of local features and the dimensionality reduction processing of the feature vector, and perform loss calculation based on the constructed loss function again.
- step 650 it is determined whether the preset stopping condition is met, and if so, the model parameters of the CNN model are output, otherwise, step 610 is returned.
- a count threshold can be set to control the number of iterations of training
- a loss threshold can also be set as a preset stop condition
- the convergence threshold of a model parameter can also be set as a preset stop condition.
- an end-to-end training location recognition model can be truly realized.
- the resulting CNN model can directly obtain low-dimensional image features, thereby improving The performance of location recognition.
- this application proposes a differentiable feature compression layer NetPCA, which is used to compress features in the CNN model.
- NetPCA is used to compress features in the CNN model.
- the entire CNN model for location recognition can truly realize end-to-end training, and the finally trained CNN model can directly obtain image features with low dimensions, high discrimination and good generalization.
- integrating NetPCA into CNN model training can significantly reduce the computational overhead and greatly reduce the risk of the algorithm falling into over-fitting compared with feature compression as a model post-processing step.
- Figures 7-8 exemplarily show the performance comparison of the location recognition model in the present application and related technologies.
- the curve corresponding to f VLAD in Fig. 7 represents the performance of NetVLAD with 32k dimensions; the curves corresponding to numbers 512, 1024, 2048, and 4096 represent respectively, based on the location recognition model (NetVLAD+NetPCA) of the embodiment of this application, the compressed representation vector The performance when the dimension (ie the second dimension) is set to 512, 1024, 2048, and 4096, respectively.
- Fig. 9 is a flow chart showing a method for location identification according to an exemplary embodiment. As shown in FIG. 9, the location identification method can be executed by any computing device, and can include the following steps 910-930.
- step 910 a compressed representation vector is extracted from the acquired image sequence using the trained CNN model.
- the CNN model used in step 910 can be obtained by training the model training method for location recognition described in any of the foregoing embodiments.
- step 930 location recognition is performed based on the extracted compressed representation vector.
- Step 930 is to recognize the spatial location corresponding to the query image.
- the CNN model performs local feature extraction and aggregation and dimensionality reduction processing on the input image (simplified as an image feature extractor in Figure 10), and finally obtains the compressed representation vector f of the image, thereby converting the image with the marked position
- Both the database and the image to be queried are projected into the image feature space.
- the location of the image in the database is the location of the image to be queried.
- an end-to-end training location recognition model can be truly realized.
- the resulting CNN model can directly obtain low-dimensional image features, thereby improving The performance of location recognition.
- Fig. 11 is a block diagram showing a model training device for location recognition according to an exemplary embodiment.
- the model training device includes but is not limited to: a feature extraction module 1110, a feature aggregation module 1120, a feature compression module 1130, and a model training module 1140.
- the feature extraction module 1110 is configured to extract local features of a sample image based on the first part of the convolutional neural network CNN model, the sample image including at least one group of multiple images taken at the same place;
- the feature aggregation module 1120 is configured to aggregate the local features into a feature vector having the first dimension based on the second part of the CNN model;
- the feature compression module 1130 is configured to: obtain a compressed representation vector of the feature vector based on the third part of the CNN model, the compressed representation vector having a second dimension smaller than the first dimension;
- the model training module 1140 is configured to: adjust the first part, the second part, and the third part with the goal of minimizing the distance between the compressed representation vectors corresponding to the multiple images taken at the same place Until the CNN model that meets the preset conditions is obtained.
- the feature compression module 1130 is configured to project the feature vector to a unit orthogonal space based on the third part to obtain the compressed representation vector.
- the third part is a fully connected layer that receives the feature vector input in the CNN model, and the fully connected layer includes a number of neurons of the second dimension, and each neuron
- the weight matrix of is a unit vector and has the first dimension, and the weight matrices of the neurons satisfy an orthogonal relationship.
- the model training module 1140 is configured to construct a loss function of the CNN model based on orthogonal constraint terms of the weight matrix, and the orthogonal constraint terms are passed through the weight matrix of each neuron. , And a known unit vector; optionally, the expression of the orthogonal constraint term G is shown in the above equation (8).
- the sample image includes a first image, a plurality of second images whose shooting locations are the same as the first image, and a plurality of third images whose shooting locations are different from the first image.
- the feature vector includes a first feature vector corresponding to the first image, a second feature vector corresponding to the second image, and a third feature vector corresponding to the third image.
- the model training module 1140 is configured to construct a loss function of the CNN model based on a first distance and a second distance; the first distance is between the first feature vector and the second feature vector The second distance is the distance between the first feature vector and the third feature vector; and the loss function is backpropagated through the CNN model to update the model parameters until all The CNN model meets the preset convergence conditions.
- the loss function constructed by the model training module 1140 is as shown in the above formula (7).
- the feature extraction module 1110 is configured to extract local features of the sample image using the VGGNet structure.
- the feature aggregation module 1120 is configured to use the NetVLAD structure to aggregate the local features into the feature vector.
- an end-to-end training location recognition model can be truly realized.
- the obtained CNN model can directly obtain low-dimensional image features, thereby improving The performance of location recognition.
- Fig. 12 is a block diagram showing a location identification device according to an exemplary embodiment.
- the location identification device includes but is not limited to: an extraction module 1210 and an identification module 1220.
- the extraction module 1210 is configured to use the trained CNN model to extract a compressed representation vector from the collected image.
- the CNN model used in the extraction module 1210 can be obtained by training the model training device for location recognition described in any of the above embodiments.
- the recognition module 1220 is configured to perform location recognition based on the compressed representation vector extracted by the extraction module 1210.
- the extraction module 1210 uses the trained CNN model to perform local feature extraction and aggregation and dimensionality reduction processing on the input image, and finally obtains the compressed representation vector f of the image, thereby combining the image database that has been marked with the position
- the queried images are all projected into the image feature space.
- the recognition module 1220 calculates the similarity between the two, if the similarity between the image to be queried and the most similar image in the database meets a certain threshold , It is considered that the location of the image in the database is the location of the image to be queried.
- an end-to-end training location recognition model can be truly realized.
- the resulting CNN model can directly obtain low-dimensional image features, thereby improving The performance of location recognition.
- modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
- the features and functions of two or more modules or units described above may be embodied in one module or unit.
- the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
- a component displayed as a module or unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the present disclosure.
- the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) execute the method according to the embodiment of the present application.
- a computing device which can be a personal computer, a server, a touch terminal, or a network device, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Library & Information Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Image Analysis (AREA)
Abstract
一种地点识别及其模型训练的方法和装置、计算机可读存储介质以及电子设备。方法包括:基于CNN模型的第一部分提取样本图像的局部特征(310);基于CNN模型的第二部分将局部特征聚合成具有第一维数的特征向量(330);基于CNN模型的第三部分得到特征向量的压缩表示向量,压缩表示向量具有小于第一维数的第二维数(350);以及以使得在同一地点拍摄的多个图像对应的压缩表示向量之间的距离最小化为目标,调整第一至第三部分的模型参数(370)。该方法通过在CNN模型中引入参数可训练的压缩过程,能够真正实现端到端的训练地点识别模型,得到的CNN模型能够直接获得低维度的图像特征,从而基于人工智能提高地点识别的性能。
Description
本申请要求于2019年05月10日提交的申请号为201910390693.2、发明名称为“地点识别及其模型训练的方法和装置以及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机技术领域,特别涉及一种地点识别及其模型训练的方法和装置、计算机可读存储介质以及电子设备。
随着图像处理技术的进步,地点识别(PlaceRecognition)得到了越来越广泛的应用。例如,在地图应用中,通过地点识别可辨别相同的位置,从而修正地图建立过程中可能存在的地点和位置错误。又例如,在视频应用中,通过地点识别可对图像片段进行分类,据此对视频进行抽象和切分,以提取视频精华。再例如,地点识别还可用于各种移动端应用的增强现实(AugmentedReality,AR)功能,在用户使用移动端拍摄所看到的景物时,可通过地点识别确定对应的景物名称,然后触发相应的简介、AR浏览功能。
地点识别主要面临条件变化、视角变化和效率要求三种挑战。为了应对这些困难,业内目前发展出三类实现方式。
第一类方式是基于人工设计的描述子(Descriptor)对地点图像进行特征提取;这种方式对于视角变化具有较强的鲁棒性,但无法针对应用场景变化进行自动调整。
第二类方式是使用预训练的卷积神经网络(ConvolutionalNeuralNetwork,CNN)作为地点图像的特征提取器;这种方式与前一种相比提升了对抗条件变化的能力,但由于其使用的CNN模型原本是在其他领域进行的预训练,因此性能提升有限。
第三类方式是直接将地点识别作为训练目标,首先使用常见网络提取地点图像的描述子,然后聚合成特定维数的特征向量;这类算法明显提高了地点识 别对条件和视角变化的鲁棒性,但由于获得的图像特征通常具有较高的维度,因而计算处理的成本较高,往往难以满足地点识别的效率要求。
发明内容
本申请提供了一种地点识别及其模型训练的方法和装置、计算机可读存储介质以及电子设备。
根据本申请的实施例,提供一种用于地点识别的模型训练方法,所述方法包括:基于CNN模型的第一部分提取样本图像的局部特征,所述样本图像包括至少一组在同一地点拍摄的多个图像;基于所述CNN模型的第二部分将所述局部特征聚合成具有第一维数的特征向量;基于所述CNN模型的第三部分得到所述特征向量的压缩表示向量,所述压缩表示向量具有小于所述第一维数的第二维数;以及以使得所述在同一地点拍摄的多个图像对应的压缩表示向量之间的距离最小化为目标,调整所述第一部分、所述第二部分和所述第三部分的模型参数,直至得到满足预设条件的CNN模型。
根据本申请的实施例,提供一种地点识别方法,包括:使用CNN模型对采集的图像提取压缩表示向量,所述CNN模型根据如上所述的用于地点识别的模型训练方法训练得到;以及基于所述提取的压缩表示向量进行地点识别。
根据本申请的实施例,提供一种用于地点识别的模型训练装置,包括:特征提取模块,设置为基于CNN模型的第一部分提取样本图像的局部特征,所述样本图像包括至少一组在同一地点拍摄的多个图像;特征聚合模块,设置为基于所述CNN模型的第二部分将所述局部特征聚合成具有第一维数的特征向量;特征压缩模块,设置为基于所述CNN模型的第三部分得到所述特征向量的压缩表示向量,所述压缩表示向量具有小于所述第一维数的第二维数;以及模型训练模块,设置为以使得所述在同一地点拍摄的多个图像对应的压缩表示向量之间的距离最小化为目标,调整所述第一部分、所述第二部分和所述第三部分的模型参数,直至得到满足预设条件的CNN模型。
根据本申请的实施例,提供一种地点识别装置,包括:提取模块,设置为使用CNN模型对采集的图像提取压缩表示向量,所述CNN模型根据如上所述的用于地点识别的模型训练方法训练得到;以及识别模块,设置为基于所述提取的压缩表示向量进行地点识别。
根据本申请的实施例,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的用于地点识别的模型训练方法或者如上所述的地点识别方法。
根据本申请的实施例,提供一种电子设备,包括:处理器;以及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述的用于地点识别的模型训练方法或者如上所述的地点识别方法。
本申请的实施例提供的技术方案可以包括以下有益效果:
基于本申请实施例提供的模型训练及地点识别方案,通过在CNN模型中引入参数可训练的压缩过程,能够真正实现端到端的训练地点识别模型,得到的CNN模型能够直接获得低维度的图像特征,从而提高地点识别的性能。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。
图1示出了可以应用本申请实施例的模型训练方法或装置、或者地点识别方法或装置的示例性系统架构的示意图。
图2示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。
图3是根据一示例性实施例示出的一种用于地点识别的模型训练方法的流程图。
图4是根据另一示例性实施例示出的一种用于地点识别的模型训练方法的流程图。
图5是图4所示实施例的基础网络结构示意图。
图6是图4所示实施例中步骤490的示意性流程图。
图7-图8示例性示出本申请实施例与相关技术中地点识别模型的性能比较。
图9是根据一示例性实施例示出的一种地点识别方法的流程图。
图10是图9所示实施例中步骤920的示意性实施场景图。
图11是根据一示例性实施例示出的一种用于地点识别的模型训练装置的框 图。
图12是根据一示例性实施例示出的一种地点识别装置的框图。
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
图1示出了可以应用本申请实施例的用于地点识别的模型训练方法或装置、或者地点识别方法或装置的示例性系统架构100的示意图。
如图1所示,系统架构100可以包括终端设备101、102、103中的一种或多种,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。比如服务器105 可以是多个服务器组成的服务器集群等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、便携式计算机和台式计算机等等。服务器105可以是提供各种服务的服务器。
例如,用户利用终端设备103(也可以是终端设备101或102)向服务器105上传了样本图像序列,该样本图像序列中包括至少一组在同一地点拍摄的多个图像;服务器105可以基于上述样本图像序列,基于CNN模型的第一部分提取样本图像的局部特征;基于所述CNN模型的第二部分将所述局部特征聚合成具有第一维数的特征向量;基于所述CNN模型的第三部分得到所述特征向量的压缩表示向量,所述压缩表示向量具有小于所述第一维数的第二维数;以及以使得所述多个图像对应的压缩表示向量之间的距离最小化为目标,调整所述第一部分、所述第二部分和所述第三部分的模型参数,直至得到满足预设条件的CNN模型。
又例如,用户利用终端设备101(也可以是终端设备102或103)在某一地点拍摄图像,并上传至服务器105;服务器105使用前述训练好的CNN模型对该图像提取压缩表示向量,并基于提取的压缩表示向量进行地点识别。
在一些实施例中,本申请实施例所提供的用于地点识别的模型训练方法或者地点识别方法一般由服务器105执行,相应地,用于地点识别的模型训练装置或者地点识别装置一般设置于服务器105中。在另一些实施例中,某些终端可以具有与服务器相似的功能从而执行本方法。因此,本申请实施例所提供的用于地点识别的模型训练方法或者地点识别方法不限定在服务器端执行。
图2示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。
需要说明的是,图2示出的电子设备的计算机系统200仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图2所示,计算机系统200包括中央处理单元(CPU)201,其可以根据存储在只读存储器(ROM)202中的程序或者从存储部分208加载到随机访问存储器(RAM)203中的程序而执行各种适当的动作和处理。在RAM 203中,还存储有系统操作所需的各种程序和数据。CPU 201、ROM 202以及RAM 203通过总线204彼此相连。输入/输出(I/O)接口205也连接至总线204。
以下部件连接至I/O接口205:包括键盘、鼠标等的输入部分206;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分207;包括硬盘等的存储部分208;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分209。通信部分209经由诸如因特网的网络执行通信处理。驱动器210也根据需要连接至I/O接口205。可拆卸介质211,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器210上,以便于从其上读出的计算机程序根据需要被安装入存储部分208。
特别地,根据本申请的实施例,下文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分209从网络上被下载和安装,和/或从可拆卸介质211被安装。在该计算机程序被中央处理单元(CPU)201执行时,执行本申请的系统中限定的各种功能。
需要说明的是,本申请所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电 线、光缆、RF等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。
作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现如下述实施例中所述的方法。例如,所述的电子设备可以实现如图3至图6所示的各个步骤。
本申请各个实施例所示的方案,可以通过人工智能(Artificial Intelligence,AI)来进行准确的图像特征提取。在详细阐述本申请的实施例的技术方案之前,以下介绍一些相关的技术方案、术语和原理。
人工智能AI
人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有 软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)
计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(Optical Character Recognition,ORC)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
机器学习(Machine Learning,ML)
机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
卷积神经网络(Convolutional Neural Network,CNN)
CNN是一种多层的监督学习神经网络,常用来处理图像相关的机器学习问题。
典型的CNN由卷积层(Convolution)、池化层(Pooling)和全连接层(Fully Connection)组成。其中,低隐层一般由卷积层和池化层交替组成,卷积层的作用是通过卷积运算使图像的原信号特征增强并降低噪音,池化层的作用在于根据图像局部相关性的原理减少计算量同时保持图像旋转不变性。全连接层位于CNN的高层,其输入是由卷积层和池化层进行特征提取得到的特征图像,输出可连接分类器,通过采用逻辑回归、Softmax回归、或者是支持向量机(Support Vector Machine,SVM)对输入图像进行分类。
CNN的训练过程一般采用梯度下降法最小化损失函数,通过全连接层后连接的损失层,对网络中各层的权重参数逐层反向传播加以调节,并通过频繁的迭代训练提高网络的精度。CNN的训练样本集通常由形如“输入向量,理想输出向量”的向量对构成,在开始训练前,网络所有层的权重参数可以用一些不同的小随机数进行初始化。由于CNN本质上可视为一种输入到输出的映射,能够学习大量的输入与输出之间的映射关系,而不需要任何输入和输出之间的精确数学表达式,因此可以用已知向量对组成的训练样本集对CNN加以训练,使其具有输入输出对之间的映射能力。
地点识别
就应用场景而言,地点识别常用于同时定位与地图构建(Simultaneous Localization and Mapping,SLAM)中的回环检测以及基于图像的定位。
在视觉SLAM问题中,位姿的估计往往是一个递推的过程,即由上一帧位姿计算当前帧位姿,这样一帧一帧的传递下去不可避免的会产生累计误差。回环检测的关键,就是如何有效的检测出相机经过同一个地方,其关系到估计的轨迹和地图在长时间下的正确性。由于回环检测提供了当前数据与所有历史数据的关联,因此可以大大减小SLAM前端产生的累计误差,构建一个几何一致的地图。地点识别在回环检测中就是起到识别相机是否回到同一个地方的作用。由于回环检测有矫正视觉SLAM前端累计误差的作用,因此可以应用在与AR相关的应用上,用来矫正由于视觉里程计长时间运行导致的位姿不准确和定位丢失的问题。
基于图像的定位,就是根据图像获取其对应的地理位置,其应用场景也很广阔。例如,可以将终端拍摄的图片上传到标有地理位置的图像数据库或搜索引擎,通过地点识别技术来获得拍摄人对应的高精度地理位置。基于图像的定位例如可以在GPS信号较弱或地形复杂的地方发挥作用,在这种情况下手机定位难免出现偏差,从而可以使用手机拍摄一张当前位置照片,通过地点识别技术来获取精确的定位。
就技术实现而言,地点识别目的在于识别查询图像对应的空间位置。对于一个标记位置的图像数据库和待查询的图像,地点识别通过图像特征提取器,将这些图像都投影到特征空间中,然后计算待查询图像的图像特征与数据库中 样本图像的相似度,如果待查询图像与数据库中最相似的图像之间的相似度满足某个阈值,则认为数据库中该图像的位置即待查询图像的位置。因此,地点识别中最关键的部分就是获取恰当的图像特征提取器。
图像特征提取器的构建通常建模成示例检索问题,主要包含三个步骤。首先,提取图像的局部描述子;然后,将局部描述子聚合成具有固定维数的特征向量;最后,将特征向量压缩到合适的维度。
然而,如背景技术部分所述,相关技术中基于训练的地点识别算法,都仅针对前两个步骤进行训练,最后的特征压缩步骤只是作为模型训练完成后的后处理过程。这样一来,模型输出的图像特征维度很高,从而会造成两个问题。一是易陷入维度灾难效应,使算法出现过拟合,并且会降低欧氏距离的区分度,恶化模型的性能。二是直接使用高维度图像特征计算图像间相似度,计算开销过大,而在获得高维度图像特征后再使用压缩算法进行后处理,往往也需要较长的计算时间,达不到地点识别的效率要求。
为解决上述问题,本申请实施例提供一种地点识别及其模型训练的方法和装置、计算机可读存储介质以及电子设备。
以下对本申请实施例的技术方案的原理和实现细节进行详细阐述。
图3是根据一示例性实施例示出的一种用于地点识别的模型训练方法的流程图。如图3所示,该模型训练方法可以由任意计算设备执行,可包括以下步骤310-370。
在步骤310中,基于CNN模型的第一部分提取样本图像的局部特征。
这里的样本图像包括至少一组在同一地点拍摄的多个图像。在一个实施例中,所述样本图像包括在不同地点拍摄的多组图像,而每组都包括在同一地点拍摄的多个图像。
如前文所述,地点识别的目的在于识别查询图像对应的空间位置。因此,用于训练模型的样本图像可带有已标记的地理位置信息,例如GPS信息。
举例而言,对于样本图像中的一个图像,可以将样本图像中与其拍摄地点相同的多个图像标记为正样本,而将与其拍摄地点不同的多个图像标记为负样本。地点识别模型的训练过程,就是通过不断调整模型参数,使最终得到的模型对于样本图像中每个图像的向量表示与其正样本向量表示之间的距离最小化, 而与其负样本向量表示之间的距离满足预设边界。
需要说明的是,本申请实施例中所述的拍摄地点相同和不同,只是为了便于描述,而并非指图像的位置信息完全一致。在一个实施例中,拍摄地点相同是指两个图像的地理位置信息(例如GPS信息)之间的差值小于第一预设值,拍摄地点不同是指两个图像的地理位置信息(例如GPS信息)之间的差值大于第二预设值。
接续,从一般意义上来说,这里的特征提取是图象处理中的初级运算。换言之,特征提取是对一个图像进行的第一个运算处理,其主要用于确定每个像素是否代表一个特征。作为特征提取的前提运算,输入的样本图像还可通过高斯模糊核在尺度空间中被平滑,此后通过局部导数运算来计算图像的一个或多个特征。
这里的局部特征从总体上说是图像中一些有别于其周围的地方。局部特征通常是描述图像中的特定区域,使图像能具有高可区分度。因此,上述的特征提取实质上就是为了提取图像的局部特征,其结果直接决定后续图像分类、识别的性能。
在图像处理领域中,计算机视觉研究过去长期集中在如尺度不变特征变换(Scale-invariant feature transform,SIFT)和方向梯度直方图(Histogram of Oriented Gradient,HOG)等基于人工标定的图像特征提取器上。例如背景技术部分提及的第一类地点识别实现方式,便是基于人工设计的描述子对图像进行特征提取。
随着深度学习(Deep Learning)研究的不断深入,图像处理中越来越普遍的将自动特征提取作为基础层,随之产生出如AlexNet和VGGNet等诸多特征提取网络模型。这些模型逐渐取代人工标定的特征图像提取器,实现了自动学习和提取图像特征。例如背景技术部分提及的第二类和第三类地点识别实现方式,就采用了可训练的CNN模型来实现自动学习和提取图像特征。
在一个实施例中,步骤310中采用可训练的CNN模型中的一部分来实现样本图像的特征提取,得到样本图像的局部描述子。
接续如图3所示,在步骤330中,基于CNN模型的第二部分将局部特征聚合成具有第一维数的特征向量。
在地点识别的实际应用中,待查询图像需要与数据库的海量图像进行特征 比对。虽然步骤310得到的局部描述子已经是样本图像的特征表示,但即使每个描述子都只需要几比特大小,考虑每个图像对应的描述子个数和数据库中图像的数量,直接基于局部描述子进行待查询图像的地点识别很难在足够短的时间内实现。因此,本申请的实施例在步骤330中对局部描述子进行聚合处理,目标是将这些描述子聚合到特定维数的向量中去。
相关技术中使用的聚合算法包括主要词袋(Bag-of-words,BoW)、费希尔向量(Fisher Vector,FV)及局部聚合描述子向量(Vector of Locally Aggregated Descriptors,VLAD)等。BoW方法的核心思想是提取出关键点描述子后利用聚类的方法训练一个码本,随后基于每幅图片中各描述子向量在码本中各中心向量出现的次数来表示该图片。FV方法的核心思想是利用高斯混合模型,通过计算该模型中的均值、协方差等参数来表示每张图像。VLAD是一种广泛应用在示例检索与图像分类中的描述池化方法,用于抓取图像中局部特征在图像中聚合的统计信息,与BoW记录向量的出现次数不同,VLAD是记录每个描述子向量的残差和。下面以VLAD为例描述局部特征聚合的大致过程。
给定N个D维局部图像描述子x
i作为输入,K个聚类中心c
k作为VLAD的参数,VLAD的输出是一个K×D维的图像描述向量,为了方便记作V是一个K×D的矩阵,将该矩阵转换成向量表示,然后再进行归一化,计算公式如下:
其中,x
i(j)和c
k(j)分别表示第i个局部描述子和第k个聚类中心的第j个特征值。a
k(x
i)可以理解为第i个局部特征属于第k个聚类的权重;换言之,如果a
k(x
i)的值为1则表明该特征属于这个聚类的簇,反之为0则不属于。直观上看,V(j,k)表征着所有局部特征在每一个聚类簇上的残差(x
i-c
k)的和。
对于传统VLAD而言,由于a
k(x
i)的值只能是1或者0,是一个不连续的值,因此不能直接在CNN模型中通过反向传播来训练。至于BoW、FV等其他聚合算法也存在类似的问题。
为此,在一个实施例中,步骤330采用可训练的CNN模型中的一部分来实现局部特征聚合,将步骤310得到的局部特征聚合成具有第一维数的特征向量。
接续如图3所示,在步骤350中,基于CNN模型的第三部分得到特征向量的压缩表示向量。
这里的压缩表示向量具有小于第一维数的第二维数。
基于步骤320进行局部特征聚合后获得的特征向量,通常仍然具有较高的维度,难以满足地点识别的效率要求,并且容易陷入维度灾难导致图像特征的泛化性能降低。
为此,本申请的实施例在步骤350中对上述特征向量进行降维处理。
简单来说,降维是一种针对高维度特征数据的预处理方法,目的是从高维度的数据中去除噪声和不重要的特征,保留下最重要的一些特征,从而实现提升数据处理速度的目的。在图像处理场景中,降维能够节省大量的处理时间,同时将信息损失控制在一定范围内。相关的降维算法包括奇异值分解(Singular Value Decomposition,SVD)、主成分分析(Principal Component Analysis,PCA)、因子分析(Factor Analysis,FA)、独立成分分析(Independent Component Analysis,ICA)等等。下面以PCA为例描述降维压缩的大致过程。
假设步骤330中使用可训练的VLAD网络进行局部特征聚合,如前文所述,针对每个图像,其输出一个K×D维(即第一维数)的图像描述向量。给定一个包括M个图像的样本图像集,则整个图像集对应的特征表示矩阵为X∈R
(K×D)×M。
接续,假设PCA的目标是要获得维度为L(即第二维数)的压缩表示向量。首先,基于矩阵X得到其减去均值之后的矩阵X’;然后,计算矩阵X’的协方差矩阵的特征值和正交单位特征向量,得到前L个特征值对应的单位特征向量则是矩阵X对应的前L个主成分,构成的矩阵可表示为T∈R
(K×D)×L;最后,将矩阵T的转置T
T乘以矩阵X,即可得到整个图像集的压缩表示Y∈R
L×M。
上述的传统PCA算法并非可微的过程,因此不能直接在CNN模型中通过反向传播来训练。至于SVD、FA、ICA等其他降维算法也存在类似的问题。
为此,在一个实施例中,步骤350采用可训练的CNN模型中的一部分来实现降维处理,将步骤330得到的特征向量降维至维数更小的压缩表示向量。
接续如图3所示,在步骤370中,以使得在同一地点拍摄的多个图像对应的压缩表示向量之间的距离最小化为目标,调整CNN模型第一部分、第二部分和第三部分的模型参数,直至得到满足预设条件的CNN模型。
在一个实施例中,这里的CNN模型可采用典型的网络结构,上述的第一部分、第二部分和第三部分可分别包括卷积层、池化层、全连接层和Softmax层 中的一层或多层。例如,第一部分可包括卷积层和池化层,第二部分可包括Softmax层,第三部分可包括全连接层。
作为模型训练的一个示例,在对卷积层和全连接层的参数赋以随机的初始值后,将步骤310的样本图像输入模型,通过步骤330、350和370可提取得到对应的压缩表示向量;基于Softmax层计算的联合损失经模型反向传播,可对卷积层和全连接层的参数进行更新,将样本图像再次输入参数更新后的模型,依此迭代直至满足预设的收敛条件,可得到训练好的CNN模型。
基于本申请实施例提供的模型训练方法,通过在CNN模型中引入参数可训练的压缩过程,能够真正实现端到端的训练地点识别模型,得到的CNN模型能够直接获得低维度的图像特征,从而提高地点识别的性能。
图4是根据另一示例性实施例示出的一种用于地点识别的模型训练方法的流程图。如图4所示,该模型训练方法可以由任意计算设备执行,可包括以下步骤410-490。
在步骤410中,构建样本图像集。
在一个实施例中,步骤410可使用公开的图像数据库来构建样本图像集,例如包括但不限于Pitts250k、Pitts30k、TokyoTM等等。
Pitts250k包含从谷歌街景收集的250k数据库图像和从街景生成的24k查询图像,这些图像是在相隔数年的不同时间拍摄。该图像集可分为三个大致相等的部分,分别用于训练,验证和测试,每个部分包含大约83k数据库图像和8k查询图像,且三个部分之间相互没有交集。
Pitts30k是Pitts250k的子集,因有助于加快训练速度而被许多算法所采用。该图像集也由三部分组成,分别用于训练,验证和测试。每组包含10k数据库图像,并且在地理上没有交集。
TokyoTM是通过收集谷歌街景全景图并将每个全景图切割成具有不同视角的12个图像获得,还包含在不同时间、同一地点拍摄的照片。因此,TokyoTM适于评估地点识别算法针对条件和视角变化鲁棒性。其包含训练集和验证集两部分。
接续如图4所示,在步骤430中,提取样本图像的局部特征。
在一个实施例中,步骤430可使用VGGNet来提取提取样本图像的局部特 征。
VGGNet由牛津大学计算机视觉组合和谷歌深思(DeepMind)公司研究员一起研发的深度CNN结构,其通过反复的堆叠3*3的小型卷积核和2*2的最大池化层,构建了深度可达19层的CNN结构。
VGGNet全部使用3*3的卷积核和2*2的池化核,通过不断加深网络结构来提升性能。由于参数量主要集中在最后三个全连接层中,网络层数的增长并不会带来参数量上的爆炸。同时,两个3*3卷积层的串联相当于1个5*5的卷积层,3个3*3的卷积层串联相当于1个7*7的卷积层。换言之,3个3*3卷积层的感受野大小相当于1个7*7的卷积层,但是前者的参数量只有后者一半左右,同时前者可以有3个非线性操作,而后者只有1个非线性操作,因此使得前者对于特征的学习能力更强。
另外,VGGNet还通过使用1*1的卷积层来增加线性变换,输出的通道数量上并没有发生改变。这里1*1的卷积层还常常被用来提取特征,即多通道的特征组合在一起,凝练成较大通道或者较小通道的输出,而每张图片的大小不变。在有些衍生的网络结构中,1*1的卷积层还可以用来替代全连接层。
VGGNet包含很多级别的网络,深度从11层到19层不等,比较常用的是VGGNet-16和VGGNet-19。VGGNet将网络分为5段,每段都包括多个3*3的卷积网络串联在一起,每段卷积后接一个最大池化层,最后面是3个全连接层和一个softmax层。
换言之,本申请实施例中CNN模型的第一部分可基于VGGNet来实现。在一个实施例中,为了接入后续CNN模型的第二部分和第三部分,可将上述VGGNet基本网络中的最后一层移除。
接续如图4所示,在步骤450中,将局部特征聚合成特征向量。
如前文实施例中所述,因权重参数a
k(x
i)的取值不连续,传统VLAD无法直接接入CNN模型进行训练。因此,在一个实施例中,步骤450可使用改进后的netVLAD来进行局部特征聚合。
可选的,netVLAD采用一种近似的方式,对权重参数a
k(x
i)进行软分配(soft assignment),如下式所示:
上述权重分配可以视作一种模糊聚类的分配方式,根据每个局部特征到聚类中心的距离来生成一个概率函数权重。对于一个局部特征描述子x
i,其在每个聚类簇下的权重范围在0-1之间,权重最高的可以理解为该特征离聚类簇中心的聚类最近,权重低表示其离簇中心较远。可以注意到,当α趋近于正无穷+∞时,式(2)就表示原始的VLAD结构。
进一步地,可以将上式(2)进行平方展开,可以得到下式:
其中,w
k′=2αc
k,b
k=-α‖c
k‖
2。
将上述(3)代入式(1),可得到最终VLAD特征向量为:
从以上推导可以看出,式(4)中的参数w
k、b
k、c
k都是可以训练的。NetVLAD通过上述在不同聚类簇上的软分配方式,能够有效聚合局部特征空间中不同部分(聚类)的一阶残差的统计量。另外,NetVLAD中包含w
k、b
k、c
k三个参数,这使得NetVLAD与仅有一个参数c
k的传统VLAD方法相比,具有更高的灵活性,并且所有的参数在特定的任务下可以通过端到端的方式来学习得到。
接续如图4所示,在步骤470中,对特征向量进行降维处理,得到对应的压缩表示向量。
如前文实施例中所述,因过程不可微,传统PCA无法直接接入CNN模型进行训练。因此,在一个实施例中,步骤470可使用如下由本申请首次提出的NetPCA来进行局部特征聚合。
本申请实施例首次提出使用神经网络来模拟PCA的功能,即NetPCA。NetPCA的核心思想在于将图像投影到正交特征空间中,使得图像表示的各个元素都线性无关,从而大大压缩图像表示中的冗余信息。与传统PCA中投影矩阵的方向是基于计算(参见步骤350中的描述)得出的主成分方向不同,NetPCA是通过端到端的训练来获得投影矩阵的方向。
在一个实施例中,将NetPCA设置为整个CNN模型中的全连接层,该层用于接收步骤450得到的特征向量输入,并具有预设数量的神经元。神经元的数量等于特征向量的压缩目标维数L(即第二维数),因此可视需求设定。并且,每个神经元的权重被约束为单位向量,且各神经元之间的权重满足正交关系, 从而保证压缩后的图像特征在单位正交空间中。
通过上述网络设计,NetPCA可实现将450得到的特征向量投影到单位正交特征空间中,得到具有目标维数的压缩表示向量。
图5示出步骤430-470对应的CNN模型基础网络架构。如图5所示,图像501首先经过VGGNet502,提取得到图像的局部特征;然后经过NetVLAD 503的局部特征聚合,得到K×D维的特征向量表示;再经过NetPCA全连接层504的降维处理,最终输出L维的压缩表示向量。
需要说明的是,这里的图像501在本实施例的模型训练过程中即为样本图像,而在后续的模型应用过程(即地点识别应用过程)中则为待查询图像。
接续如图4所示,在步骤490中,经过模型参数训练得到满足预设条件的CNN模型。
如步骤430-470中描述的网络结构,本申请实施例提出的CNN模型中,可训练的参数包括VGGNet、NetVLAD和NetPCA三个部分的权重矩阵。
为了使训练得到的CNN模型适用于地点识别任务,本申请实施例构建了合理的损失函数。
可选的,上述样本图像包括第一图像、拍摄地点与第一图像相同的多个第二图像、以及拍摄地点与第一图像不同的多个第三图像,上述特征向量包括与第一图像对应的第一特征向量、与第二图像对应的第二特征向量、以及与第三图像对应的第三特征向量;在训练CNN模型时,基于第一距离以及第二距离构建CNN模型的损失函数;其中,第一距离是第一特征向量与第二特征向量之间的距离,第二距离是第一特征向量与所述第三特征向量之间的距离。
在一个实施例中,对于样本图像中的特定图像(第一图像)q,可以将样本图像中距离该图像地理位置小于第一阈值(可视为拍摄地点相同)的图像设置为潜在正样本
而将样本图像中距离该图像地理位置大于第二阈值(可视为拍摄地点不同)的图像设置为负样本
如此一来,便可得到三元组的训练样本
基于上述训练目标,在一个实施例中,可定义三元组排序损失函数如下式所示:
其中,L为损失函数,m为边界常数,l表示最大边界损失(HingeLoss,也称铰链损失)。换言之,l(x)=max(x,0)。
从式(7)可以看出,在上述实施例中,损失函数被设计为关于负样本图像
的个体损失之和。对于每个负样本图像,如果特定图像q与其之间的距离大于特定图像q与最佳匹配图像
之间的距离且差值超过预设边界,则损失为零。相反,差值不满足该预设边界,则损失与差值成比例。
接续,在一个实施例中,为了约束NetPCA中各神经元之间的权重满足正交关系,可进一步在上述损失函数中添加正交约束项,正交约束项该是通过各个所述神经元的权重矩阵,以及已知的单位向量得到的。
可选的,该正交约束项G如下所示:
G=sum(g(W
TW-E)) (8)
其中,W为所述神经元的权重矩阵,T表示矩阵转置,E为已知的单位向量,g表示对矩阵的各个元素求平方,sum表示求和。
这样一来,NetPCA针对特征向量的最佳投影方向可通过端到端的训练来确定,当训练满足预设的收敛条件时,神经元的权重W即为确定出的最佳投影方向。
在一个实施例中,基于上述添加正交约束项G的损失函数,步骤490可使用标准的随机梯度下降算法(SGD,Stochastic Gradient Descent)进行CNN的训练,一个示例如图6所示,包括以下步骤610-650。
在步骤610中,将损失函数经CNN模型反向传播以更新CNN模型的模型参数。
以包括卷积层、池化层和全连接层的典型CNN结构为例,除了池化层可采用随机或固定的参数矩阵外,卷积层和全连接层的权重参数矩阵在CNN的训练过程中可基于损失的反向传播进行更新。就本申请实施例的网络结构而言,在 整个CNN模型中,用于局部特征提取的第一部分、用于特征聚合的第二部分以及用于降维处理的第三部分中的权重参数矩阵,在CNN的训练过程中都可基于损失的反向传播进行更新。
另外,对于初始化(尚无输入数据)时的CNN模型,卷积层和全连接层的权重参数矩阵可以用一些不同的小随机数进行初始化。例如,可按照0为均值、0.01为方差的高斯分布对所有卷积层的卷积核进行初始化。
在步骤630中,基于参数经过更新的CNN模型重新进行损失函数的计算。
经过步骤610的模型参数更新,可再次执行上述步骤430-470,进行局部特征的提取和聚合以及特征向量的降维处理,并重新基于构建的损失函数进行损失计算。
在步骤650中,判断是否满足预设停止条件,若是则输出CNN模型的模型参数,否则返回步骤610。
根据模型的训练表现,针对步骤650可设置不同的预设停止条件。例如,可设置计数阈值控制训练的迭代次数,也可设置损失阈值作为预设停止条件,还可设置模型参数的收敛阈值作为预设停止条件。本申请的实施例对此并无限制。
基于本申请实施例提供的模型训练方法,通过在CNN模型中引入参数可训练的压缩过程,能够真正实现端到端的训练地点识别模型,得到的CNN模型能够直接获得低维度的图像特征,从而提高地点识别的性能。
在上述实施例中,本申请提出可微分的特征压缩层NetPCA,用于压缩CNN模型中的特征。基于该NetPCA层,整个用于地点识别的CNN模型可真正实现端到端的训练,并且最终训练得到的CNN模型可以直接获得低维度、高区分度和具有良好泛化性的图像特征。另外,将NetPCA融入CNN模型训练,与将特征压缩作为模型后处理步骤相比,能够显著减小计算开销,大大降低算法陷入过拟合的风险。
图7-图8示例性示出本申请与相关技术中地点识别模型的性能比较。图7中f
VLAD对应的曲线表示具有32k维度的NetVLAD的性能;数字512、1024、2048和4096对应的曲线分别表示,基于本申请实施例的地点识别模型(NetVLAD+NetPCA),压缩表示向量的维度(即第二维度)分别设置为512、1024、2048和4096时的性能。图8中数字512、1024、2048和4096对应的曲 线则分别表示,使用传统PCA方法将NetVLAD所输出32k维度的图像特征分别压缩至512、1024、2048和4096维度时的性能。二图均基于Pitts250k的测试集绘制,横坐标表示数据集中最佳匹配项的数目,纵坐标表示查全率(Recall,也称召回率,单位为%)。
从图7可以看出,基于本申请实施例的地点识别模型,即使将NetPCA的输出维度设置为512,仍表现出与NetVLAD的32k维度图像特征相当的性能。由此可见,本申请实施例能够以显著减小的计算开销达到NetVLAD相当的性能,实现了通过端到端训练获得高区分度的图像特征。
从图7与图8对比可以看出,在输出维度相同的情况下,本申请实施例的地点识别模型,要明显优于在NetVLAD后进行常规PCA降维处理的性能。
图9是根据一示例性实施例示出的一种地点识别方法的流程图。如图9所示,该地点识别方法可以由任意计算设备执行,可包括以下步骤910-930。
在步骤910中,使用训练得到的CNN模型对采集的图像序列提取压缩表示向量。
这里步骤910中使用的CNN模型可通过上述任一实施例所描述的用于地点识别的模型训练方法训练得到。
在步骤930中,基于提取的压缩表示向量进行地点识别。
地点识别是识别查询图像对应的空间位置,步骤930的实施可参照图10的示意。经过步骤910,CNN模型对输入图像进行局部特征提取和聚合以及降维处理(图10中简化示为图像特征提取器),最终得到图像的压缩表示向量f,从而将已进行过标记位置的图像数据库和待查询的图像,都投影到图像特征空间中。然后,对于待查询图像的压缩表示向量与数据库中样本图像的压缩表示向量,计算二者之间的相似度,如果待查询图像与数据库中最相似的图像的相似度满足某个阈值,则认为数据库中该图像的位置即待查询图像的位置。
基于本申请实施例提供的地点识别方法,通过在CNN模型中引入参数可训练的压缩过程,能够真正实现端到端的训练地点识别模型,得到的CNN模型能够直接获得低维度的图像特征,从而提高地点识别的性能。
下述为本申请装置实施例,可以用于执行本申请上述模型训练方法及地点 识别方法的实施例。对于本申请装置实施例中未披露的细节,可参照本申请上述模型训练方法及地点识别方法实施例。
图11是根据一示例性实施例示出的一种用于地点识别的模型训练装置的框图。该模型训练装置,如图11所示,包括但不限于:特征提取模块1110、特征聚合模块1120、特征压缩模块1130及模型训练模块1140。
特征提取模块1110设置为:基于卷积神经网络CNN模型的第一部分提取样本图像的局部特征,所述样本图像包括至少一组在同一地点拍摄的多个图像;
特征聚合模块1120设置为:基于所述CNN模型的第二部分将所述局部特征聚合成具有第一维数的特征向量;
特征压缩模块1130设置为:基于所述CNN模型的第三部分得到所述特征向量的压缩表示向量,所述压缩表示向量具有小于所述第一维数的第二维数;以及
模型训练模块1140设置为:以使得所述在同一地点拍摄的多个图像对应的压缩表示向量之间的距离最小化为目标,调整所述第一部分、所述第二部分和所述第三部分的模型参数,直至得到满足预设条件的CNN模型。
在一个实施例中,所述特征压缩模块1130设置为:基于所述第三部分将所述特征向量投影至单位正交空间,得到所述压缩表示向量。
在一个实施例中,所述第三部分为所述CNN模型中接收所述特征向量输入的全连接层,所述全连接层包括数量为所述第二维数的神经元,每个神经元的权重矩阵为单位向量且具有所述第一维数,所述神经元的权重矩阵之间满足正交关系。
在一个实施例中,所述模型训练模块1140设置为:基于所述权重矩阵的正交约束项构建所述CNN模型的损失函数,所述正交约束项是通过各个所述神经元的权重矩阵,以及已知的单位向量得到的;可选的,该正交约束项G的表达式如上式(8)所示。
在一个实施例中,所述样本图像包括第一图像、拍摄地点与所述第一图像相同的多个第二图像、以及拍摄地点与所述第一图像不同的多个第三图像,所述特征向量包括与所述第一图像对应的第一特征向量、与所述第二图像对应的第二特征向量、以及与所述第三图像对应的第三特征向量。相应的,所述模型训练模块1140设置为:基于第一距离以及第二距离构建所述CNN模型的损失 函数;所述第一距离是所述第一特征向量与所述第二特征向量之间的距离,所述第二距离是所述第一特征向量与所述第三特征向量之间的距离;以及将所述损失函数经所述CNN模型反向传播以更新所述模型参数,直至所述CNN模型满足预设的收敛条件。
在一个实施例中,所述模型训练模块1140构建的损失函数如上式(7)所示。
在一个实施例中,所述特征提取模块1110设置为:使用VGGNet结构提取所述样本图像的局部特征。
在一个实施例中,所述特征聚合模块1120设置为:使用NetVLAD结构将所述局部特征聚合成所述特征向量。
基于本申请实施例提供的模型训练装置,通过在CNN模型中引入参数可训练的压缩过程,能够真正实现端到端的训练地点识别模型,得到的CNN模型能够直接获得低维度的图像特征,从而提高地点识别的性能。
图12是根据一示例性实施例示出的一种地点识别装置的框图。该地点识别装置,如图12所示,包括但不限于:提取模块1210和识别模块1220。
提取模块1210设置为使用训练得到的CNN模型对采集的图像提取压缩表示向量。这里,提取模块1210中使用的CNN模型可通过上述任一实施例所描述的用于地点识别的模型训练装置训练得到。
识别模块1220设置为基于提取模块1210提取的压缩表示向量进行地点识别。
在一个实施例中,提取模块1210使用训练得到的CNN模型对输入图像进行局部特征提取和聚合以及降维处理,最终得到图像的压缩表示向量f,从而将已进行过标记位置的图像数据库和待查询的图像,都投影到图像特征空间中。然后,对于待查询图像的压缩表示向量与数据库中样本图像的压缩表示向量,识别模块1220计算二者之间的相似度,如果待查询图像与数据库中最相似的图像的相似度满足某个阈值,则认为数据库中该图像的位置即待查询图像的位置。
基于本申请实施例提供的地点识别方法,通过在CNN模型中引入参数可训练的压缩过程,能够真正实现端到端的训练地点识别模型,得到的CNN模型能够直接获得低维度的图像特征,从而提高地点识别的性能。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关 该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。作为模块或单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本公开方案的目的。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。
Claims (13)
- 一种用于地点识别的模型训练方法,其特征在于,所述方法由计算机设备执行,所述方法包括:基于卷积神经网络CNN模型的第一部分提取样本图像的局部特征,所述样本图像包括至少一组在同一地点拍摄的多个图像;基于所述CNN模型的第二部分将所述局部特征聚合成具有第一维数的特征向量;基于所述CNN模型的第三部分得到所述特征向量的压缩表示向量,所述压缩表示向量具有小于所述第一维数的第二维数;以及以使得所述在同一地点拍摄的多个图像对应的压缩表示向量之间的距离最小化为目标,调整所述第一部分、所述第二部分和所述第三部分的模型参数,直至得到满足预设条件的CNN模型。
- 如权利要求1所述的方法,其特征在于,所述的基于所述CNN模型的第三部分得到所述特征向量的压缩表示向量,包括:基于所述第三部分将所述特征向量投影至单位正交空间,得到所述压缩表示向量。
- 如权利要求2所述的方法,其特征在于,所述第三部分为所述CNN模型中接收所述特征向量输入的全连接层,所述全连接层包括数量为所述第二维数的神经元,每个神经元的权重矩阵为单位向量且具有所述第一维数,所述神经元的权重矩阵之间满足正交关系。
- 如权利要求3所述的方法,其特征在于,所述的调整所述第一部分、所述第二部分和所述第三部分的模型参数,包括:基于所述权重矩阵的正交约束项构建所述CNN模型的损失函数,所述正交约束项是通过各个所述神经元的权重矩阵,以及已知的单位向量得到的。
- 如权利要求1或3所述的方法,其特征在于,所述样本图像包括第一图像、拍摄地点与所述第一图像相同的多个第二图像、以及拍摄地点与所述第一图像不同的多个第三图像,所述特征向量包括与所述第一图像对应的第一特征向量、与所述第二图像对应的第二特征向量、以及与所述第三图像对应的第三特征向量,所述的调整所述第一部分、所述第二部分和所述第三部分的模型参 数,还包括:基于第一距离以及第二距离构建所述CNN模型的损失函数;所述第一距离是所述第一特征向量与所述第二特征向量之间的距离,所述第二距离是所述第一特征向量与所述第三特征向量之间的距离;以及将所述损失函数经所述CNN模型反向传播以更新所述模型参数,直至所述CNN模型满足预设的收敛条件。
- 如权利要求1-4任一项所述的方法,其特征在于,所述的基于卷积神经网络CNN模型的第一部分提取样本图像的局部特征,包括:使用视觉几何组网络VGGNet结构提取所述样本图像的局部特征。
- 如权利要求1-4任一项所述的方法,其特征在于,所述的基于所述CNN模型的第二部分将所述局部特征聚合成具有第一维数的特征向量,包括:使用局部聚合描述子向量网络NetVLAD结构将所述局部特征聚合成所述特征向量。
- 一种地点识别方法,其特征在于,包括:使用卷积神经网络CNN模型对采集的图像提取压缩表示向量,所述CNN模型根据权利要求1至8中任一项所述的方法训练得到;以及基于所述提取的压缩表示向量进行地点识别。
- 一种用于地点识别的模型训练装置,其特征在于,所述装置包括:特征提取模块,设置为基于卷积神经网络CNN模型的第一部分提取样本图像的局部特征,所述样本图像包括至少一组在同一地点拍摄的多个图像;特征聚合模块,设置为基于所述CNN模型的第二部分将所述局部特征聚合成具有第一维数的特征向量;特征压缩模块,设置为基于所述CNN模型的第三部分得到所述特征向量的压缩表示向量,所述压缩表示向量具有小于所述第一维数的第二维数;以及模型训练模块,设置为以使得所述在同一地点拍摄的多个图像对应的压缩表示向量之间的距离最小化为目标,调整所述第一部分、所述第二部分和所述第三部分的模型参数,直至得到满足预设条件的CNN模型。
- 一种地点识别装置,其特征在于,包括:提取模块,设置为使用卷积神经网络CNN模型对采集的图像提取压缩表示向量,所述CNN模型根据权利要求1至8中任一项所述的方法训练得到;以及识别模块,设置为基于所述提取的压缩表示向量进行地点识别。
- 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现根据权利要求1至8中任一项所述的用于地点识别的模型训练方法或根据权利要求9所述的地点识别方法。
- 一种电子设备,其特征在于,包括:处理器;以及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现根据权利要求1至8中任一项所述的用于地点识别的模型训练方法或根据权利要求9所述的地点识别方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP20805484.1A EP3968179B1 (en) | 2019-05-10 | 2020-04-27 | Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device |
| US17/374,810 US12100192B2 (en) | 2019-05-10 | 2021-07-13 | Method, apparatus, and electronic device for training place recognition model |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910390693.2A CN110209859B (zh) | 2019-05-10 | 2019-05-10 | 地点识别及其模型训练的方法和装置以及电子设备 |
| CN201910390693.2 | 2019-05-10 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/374,810 Continuation US12100192B2 (en) | 2019-05-10 | 2021-07-13 | Method, apparatus, and electronic device for training place recognition model |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020228525A1 true WO2020228525A1 (zh) | 2020-11-19 |
Family
ID=67785854
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/087308 Ceased WO2020228525A1 (zh) | 2019-05-10 | 2020-04-27 | 地点识别及其模型训练的方法和装置以及电子设备 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12100192B2 (zh) |
| EP (1) | EP3968179B1 (zh) |
| CN (1) | CN110209859B (zh) |
| WO (1) | WO2020228525A1 (zh) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113034407A (zh) * | 2021-04-27 | 2021-06-25 | 深圳市慧鲤科技有限公司 | 图像处理方法及装置、电子设备和存储介质 |
| WO2022115779A1 (en) * | 2020-11-30 | 2022-06-02 | Mercari, Inc. | Automatic ontology generation by embedding representations |
| CN115294405A (zh) * | 2022-09-29 | 2022-11-04 | 浙江天演维真网络科技股份有限公司 | 农作物病害分类模型的构建方法、装置、设备及介质 |
| CN115393750A (zh) * | 2021-05-25 | 2022-11-25 | 阿里巴巴新加坡控股有限公司 | 车辆重识别方法及模型训练方法 |
Families Citing this family (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110209859B (zh) | 2019-05-10 | 2022-12-27 | 腾讯科技(深圳)有限公司 | 地点识别及其模型训练的方法和装置以及电子设备 |
| CN112749705B (zh) * | 2019-10-31 | 2024-06-11 | 深圳云天励飞技术有限公司 | 训练模型更新方法及相关设备 |
| CN110954933B (zh) * | 2019-12-09 | 2023-05-23 | 王相龙 | 一种基于场景dna的移动平台定位装置及方法 |
| CN111274816B (zh) * | 2020-01-15 | 2021-05-18 | 湖北亿咖通科技有限公司 | 一种基于神经网络的命名实体识别方法和车机 |
| EP3907696A1 (en) * | 2020-05-06 | 2021-11-10 | Koninklijke Philips N.V. | Method and system for identifying abnormal images in a set of medical images |
| CN115516519A (zh) * | 2020-05-06 | 2022-12-23 | 伊莱利利公司 | 用于视觉感知的基于层次的对象识别的方法和装置 |
| CN112036461B (zh) * | 2020-08-24 | 2023-06-02 | 湖北师范大学 | 手写数字图像识别方法、装置、设备及计算机存储介质 |
| US11914670B2 (en) * | 2020-09-08 | 2024-02-27 | Huawei Technologies Co., Ltd. | Methods and systems for product quantization-based compression of a matrix |
| CN112115997B (zh) * | 2020-09-11 | 2022-12-02 | 苏州浪潮智能科技有限公司 | 一种物体识别模型的训练方法、系统及装置 |
| CN112402978B (zh) * | 2020-11-13 | 2024-07-16 | 上海幻电信息科技有限公司 | 地图生成方法及装置 |
| CN112907088B (zh) * | 2021-03-03 | 2024-03-08 | 杭州诚智天扬科技有限公司 | 一种清分模型的参数调整方法及系统 |
| CN113282777B (zh) * | 2021-04-20 | 2025-02-25 | 北京沃东天骏信息技术有限公司 | 一种模型训练方法、装置、电子设备及存储介质 |
| CN113128601B (zh) * | 2021-04-22 | 2022-04-29 | 北京百度网讯科技有限公司 | 分类模型的训练方法和对图像进行分类的方法 |
| CN117441196A (zh) * | 2021-06-07 | 2024-01-23 | 大陆汽车科技有限公司 | 用于确定图像描述符的方法、编码流水线、以及视觉地点识别方法 |
| US12387147B2 (en) * | 2021-10-15 | 2025-08-12 | Gracenote, Inc. | Unified representation learning of media features for diverse tasks |
| CN114241227A (zh) * | 2021-12-08 | 2022-03-25 | 北京科技大学 | 一种基于vlad的图像识别方法及装置 |
| CN114639121B (zh) * | 2022-03-21 | 2025-05-30 | 银河水滴科技(江苏)有限公司 | 基于特征方向压缩的跨着装行人步态识别方法及系统 |
| US20240169194A1 (en) * | 2022-11-18 | 2024-05-23 | Salesforce, Inc. | Training neural networks for name generation |
| CN115905263B (zh) * | 2022-12-16 | 2026-02-06 | 北京百度网讯科技有限公司 | 向量数据库的更新方法及基于向量数据库的人脸识别方法 |
| CN116205309B (zh) * | 2023-01-19 | 2026-01-06 | 北京百度网讯科技有限公司 | 内容理解模型的训练方法、内容理解方法以及相关装置 |
| CN116129936A (zh) * | 2023-02-02 | 2023-05-16 | 百果园技术(新加坡)有限公司 | 一种动作识别方法、装置、设备、存储介质及产品 |
| CN116932802B (zh) * | 2023-07-10 | 2024-05-14 | 玩出梦想(上海)科技有限公司 | 一种图像检索方法 |
| CN117809066B (zh) * | 2024-03-01 | 2024-06-07 | 山东浪潮科学研究院有限公司 | 一种卷烟交付目的地一致性检验系统、方法、设备及介质 |
| DE102024202027A1 (de) * | 2024-03-05 | 2025-09-11 | Robert Bosch Gesellschaft mit beschränkter Haftung | Computerimplementiertes Verfahren und System zum Optimieren eines Clusterns einer Vielzahl von Eingabedaten |
| CN119188782B (zh) * | 2024-11-22 | 2025-05-27 | 江苏宁昆机器人智能科技有限公司 | 用于工业机器人的智能夹具路径规划方法 |
| CN120104820B (zh) * | 2025-02-20 | 2025-09-19 | 中国传媒大学 | 基于vq-vae的频高图相似性检索系统与方法 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103609178A (zh) * | 2011-06-17 | 2014-02-26 | 微软公司 | 地点辅助的识别 |
| CN105550701A (zh) * | 2015-12-09 | 2016-05-04 | 福州华鹰重工机械有限公司 | 实时图像提取识别方法及装置 |
| CN110209859A (zh) * | 2019-05-10 | 2019-09-06 | 腾讯科技(深圳)有限公司 | 地点识别及其模型训练的方法和装置以及电子设备 |
Family Cites Families (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106055576B (zh) * | 2016-05-20 | 2018-04-10 | 大连理工大学 | 一种大规模数据背景下的快速有效的图像检索方法 |
| JP2018089142A (ja) * | 2016-12-05 | 2018-06-14 | 学校法人同志社 | 脳機能イメージングデータからヒトの脳活動状態を推定する方法 |
| CN107067011B (zh) * | 2017-03-20 | 2019-05-03 | 北京邮电大学 | 一种基于深度学习的车辆颜色识别方法与装置 |
| CN107273925B (zh) * | 2017-06-12 | 2020-10-09 | 太原理工大学 | 基于局部感受野和半监督深度自编码的肺实质ct影像处理装置 |
| CN108288067B (zh) * | 2017-09-12 | 2020-07-24 | 腾讯科技(深圳)有限公司 | 图像文本匹配模型的训练方法、双向搜索方法及相关装置 |
| US11093832B2 (en) * | 2017-10-19 | 2021-08-17 | International Business Machines Corporation | Pruning redundant neurons and kernels of deep convolutional neural networks |
| US11586875B2 (en) * | 2017-11-22 | 2023-02-21 | Massachusetts Institute Of Technology | Systems and methods for optimization of a data model network architecture for target deployment |
| US10861152B2 (en) * | 2018-03-16 | 2020-12-08 | Case Western Reserve University | Vascular network organization via Hough transform (VaNgOGH): a radiomic biomarker for diagnosis and treatment response |
| US11561541B2 (en) * | 2018-04-09 | 2023-01-24 | SafeAI, Inc. | Dynamically controlling sensor behavior |
| US10685236B2 (en) * | 2018-07-05 | 2020-06-16 | Adobe Inc. | Multi-model techniques to generate video metadata |
| CN108596163A (zh) * | 2018-07-10 | 2018-09-28 | 中国矿业大学(北京) | 一种基于cnn和vlad的煤岩识别方法 |
| US10887640B2 (en) * | 2018-07-11 | 2021-01-05 | Adobe Inc. | Utilizing artificial intelligence to generate enhanced digital content and improve digital content campaign design |
| CN109255381B (zh) * | 2018-09-06 | 2022-03-29 | 华南理工大学 | 一种基于二阶vlad稀疏自适应深度网络的图像分类方法 |
| US10869036B2 (en) * | 2018-09-18 | 2020-12-15 | Google Llc | Receptive-field-conforming convolutional models for video coding |
| CN109446991A (zh) * | 2018-10-30 | 2019-03-08 | 北京交通大学 | 基于全局和局部特征融合的步态识别方法 |
| CN109684977A (zh) * | 2018-12-18 | 2019-04-26 | 成都三零凯天通信实业有限公司 | 一种基于端到端深度学习的视图地标检索方法 |
| US11756291B2 (en) * | 2018-12-18 | 2023-09-12 | Slyce Acquisition Inc. | Scene and user-input context aided visual search |
-
2019
- 2019-05-10 CN CN201910390693.2A patent/CN110209859B/zh active Active
-
2020
- 2020-04-27 EP EP20805484.1A patent/EP3968179B1/en active Active
- 2020-04-27 WO PCT/CN2020/087308 patent/WO2020228525A1/zh not_active Ceased
-
2021
- 2021-07-13 US US17/374,810 patent/US12100192B2/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103609178A (zh) * | 2011-06-17 | 2014-02-26 | 微软公司 | 地点辅助的识别 |
| CN105550701A (zh) * | 2015-12-09 | 2016-05-04 | 福州华鹰重工机械有限公司 | 实时图像提取识别方法及装置 |
| CN110209859A (zh) * | 2019-05-10 | 2019-09-06 | 腾讯科技(深圳)有限公司 | 地点识别及其模型训练的方法和装置以及电子设备 |
Non-Patent Citations (2)
| Title |
|---|
| ARANDJELOVIC RELJA; GRONAT PETR; TORII AKIHIKO; PAJDLA TOMAS; SIVIC JOSEF: "NetVLAD: CNN architecture for weakly supervised place recognition,", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 27 June 2016 (2016-06-27), pages 5297 - 5307, XP033021725, DOI: 10.1109/CVPR.2016.572 * |
| See also references of EP3968179A4 * |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022115779A1 (en) * | 2020-11-30 | 2022-06-02 | Mercari, Inc. | Automatic ontology generation by embedding representations |
| CN113034407A (zh) * | 2021-04-27 | 2021-06-25 | 深圳市慧鲤科技有限公司 | 图像处理方法及装置、电子设备和存储介质 |
| CN113034407B (zh) * | 2021-04-27 | 2022-07-05 | 深圳市慧鲤科技有限公司 | 图像处理方法及装置、电子设备和存储介质 |
| CN115393750A (zh) * | 2021-05-25 | 2022-11-25 | 阿里巴巴新加坡控股有限公司 | 车辆重识别方法及模型训练方法 |
| CN115294405A (zh) * | 2022-09-29 | 2022-11-04 | 浙江天演维真网络科技股份有限公司 | 农作物病害分类模型的构建方法、装置、设备及介质 |
| CN115294405B (zh) * | 2022-09-29 | 2023-01-10 | 浙江天演维真网络科技股份有限公司 | 农作物病害分类模型的构建方法、装置、设备及介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3968179A1 (en) | 2022-03-16 |
| US12100192B2 (en) | 2024-09-24 |
| CN110209859A (zh) | 2019-09-06 |
| EP3968179B1 (en) | 2026-04-01 |
| CN110209859B (zh) | 2022-12-27 |
| US20210342643A1 (en) | 2021-11-04 |
| EP3968179A4 (en) | 2022-06-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12100192B2 (en) | Method, apparatus, and electronic device for training place recognition model | |
| EP3084682B1 (en) | System and method for identifying faces in unconstrained media | |
| CN110503076B (zh) | 基于人工智能的视频分类方法、装置、设备和介质 | |
| CN110309856A (zh) | 图像分类方法、神经网络的训练方法及装置 | |
| WO2020228446A1 (zh) | 模型训练方法、装置、终端及存储介质 | |
| CN111709311A (zh) | 一种基于多尺度卷积特征融合的行人重识别方法 | |
| CN113705596A (zh) | 图像识别方法、装置、计算机设备和存储介质 | |
| CN110222718B (zh) | 图像处理的方法及装置 | |
| WO2020098257A1 (zh) | 一种图像分类方法、装置及计算机可读存储介质 | |
| WO2022001364A1 (zh) | 一种提取数据特征的方法和相关装置 | |
| CN117854156B (zh) | 一种特征提取模型的训练方法和相关装置 | |
| CN110472622A (zh) | 视频处理方法及相关装置,图像处理方法及相关装置 | |
| CN115049880B (zh) | 生成图像分类网络与图像分类方法、装置、设备及介质 | |
| EP4712050A1 (en) | Feature extraction model processing method and apparatus, and feature extraction method and apparatus | |
| CN117635962A (zh) | 基于多频率融合的通道注意力图像处理方法 | |
| Gao et al. | Dimensionality reduction of SPD data based on Riemannian manifold tangent spaces and local affinity | |
| WO2024066927A1 (zh) | 图像分类模型的训练方法、装置及设备 | |
| CN111382791A (zh) | 深度学习任务处理方法、图像识别任务处理方法和装置 | |
| CN111275183A (zh) | 视觉任务的处理方法、装置和电子系统 | |
| CN115862055A (zh) | 基于对比学习和对抗训练的行人重识别方法及装置 | |
| CN117058739B (zh) | 一种人脸聚类更新方法及装置 | |
| CN116994043A (zh) | 一种小样本图像识别优化方法、装置、设备及存储介质 | |
| CN112487927B (zh) | 一种基于物体关联注意力的室内场景识别实现方法及系统 | |
| CN116758365A (zh) | 视频处理方法、机器学习模型训练方法及相关装置、设备 | |
| CN117274596B (zh) | 基于神经辐射场的开集语义分割方法及系统 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20805484 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2020805484 Country of ref document: EP Effective date: 20211210 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2020805484 Country of ref document: EP |



