WO2024083121A1 - 一种数据处理方法及其装置 - Google Patents
一种数据处理方法及其装置 Download PDFInfo
- Publication number
- WO2024083121A1 WO2024083121A1 PCT/CN2023/124977 CN2023124977W WO2024083121A1 WO 2024083121 A1 WO2024083121 A1 WO 2024083121A1 CN 2023124977 W CN2023124977 W CN 2023124977W WO 2024083121 A1 WO2024083121 A1 WO 2024083121A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- text
- feature
- target object
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
Definitions
- the present application relates to the field of artificial intelligence, and in particular to a data processing method and device thereof.
- Artificial Intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that machines have the functions of perception, reasoning and decision-making.
- Language-driven precise instance segmentation is a special semantic segmentation technology. It refers to accurately segmenting the instance targets described by language in an image according to the guidance of natural language. Its characteristics are: 1) Traditional semantic segmentation models predict the same label for all targets belonging to the same category and do not distinguish different targets in the same category, while language-driven precise instance segmentation needs to accurately identify an instance target corresponding to the language description from multiple similar targets; 2) The semantic segmentation model needs to pre-define a set of semantic category labels in order to learn to segment targets of these categories, while language-driven precise instance segmentation can accept more flexible natural language input and is not limited to target categories.
- the present application provides a data processing method that can effectively solve the problems of inaccurate target positioning and mask prediction or detection box prediction in existing language-driven precise instance segmentation methods, thereby improving the processing accuracy of the model.
- the present application provides a data processing method, comprising: obtaining a first image feature corresponding to an image and a text feature corresponding to a text; the semantics of the text corresponds to a target object, and the text indicates a region corresponding to the target object predicted from the image; based on a plurality of preset first embedding vectors and the first image feature, a plurality of second embedding vectors are obtained through a neural network, each second embedding vector corresponding to an object in the image; each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature; based on the similarity between the text feature and the plurality of second embedding vectors, a weight corresponding to each second embedding vector is determined, and the plurality of weights are used to fuse (for example, weighted) with the plurality of second image features to determine a predicted region corresponding to the target object.
- the image may include multiple objects including the target object, each second embedding vector corresponds to an object in the image, and one or more embedding vectors in the multiple second embedding vectors may correspond to the target object.
- corresponding here can be understood as the second embedding vector is used to describe the characteristics of an object in the image, and the second embedding vector obtained by the neural network can distinguish different objects in the image, so that the subsequent prediction process can be based on the object granularity.
- This is equivalent to changing the image features from pixel granularity to target object granularity, that is, introducing target integrity constraints in cross-modal feature fusion, fusing pixels belonging to the same target as a whole with language encoding, and activating instance areas based on targets.
- This can effectively solve the problems of inaccurate target positioning and mask prediction or detection box prediction in existing language-driven precise instance segmentation methods, thereby improving the processing accuracy of the model.
- the predicted area is a mask area or a detection box.
- the semantics of the text corresponds to the target object, specifically including: the semantics of the text is used to describe the characteristics of the target object.
- acquiring a first image feature corresponding to the image and a text feature corresponding to the text includes:
- the third image feature and the first text feature are fused through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- the first image feature is a feature that is upsampled to a size consistent with the image.
- the neural grid includes multiple transformer layers.
- a data processing method includes:
- each second embedding vector corresponds to an object in the image
- each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature
- the feature extraction network and the neural network are updated according to the difference between the predicted area and the real area corresponding to the target object in the image.
- the image may include multiple objects including the target object, each second embedding vector corresponds to an object in the image, and one or more embedding vectors in the multiple second embedding vectors may correspond to the target object.
- corresponding here can be understood as the second embedding vector is used to describe the characteristics of an object in the image, and the second embedding vector obtained by the neural network can distinguish different objects in the image, so that the subsequent prediction process can be based on the object granularity.
- This is equivalent to changing the image features from pixel granularity to target object granularity, that is, introducing target integrity constraints in cross-modal feature fusion, fusing pixels belonging to the same target as a whole with language encoding, and activating instance areas based on targets.
- This can effectively solve the problems of inaccurate target positioning and mask prediction or detection box prediction in existing language-driven precise instance segmentation methods, thereby improving the processing accuracy of the model.
- the predicted area is a mask area or a detection box.
- the semantics of the text corresponds to the target object, specifically including: the semantics of the text is used to describe the characteristics of the target object.
- acquiring a first image feature corresponding to the image and a text feature corresponding to the text includes:
- the third image feature and the first text feature are fused through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- the present application provides a data processing device, including:
- a processing module configured to obtain a first image feature corresponding to the image and a text feature corresponding to the text; the semantics of the text corresponds to the target object, and the text indicates a region corresponding to the target object predicted from the image;
- multiple second embedding vectors are obtained through a neural network, each of which a second embedding vector corresponding to an object in the image; each of the second embedding vectors and the first image feature is used to fuse to obtain a corresponding second image feature;
- a weight corresponding to each of the second embedding vectors is determined, and the multiple weights are used to be fused with the multiple second image features to determine the predicted area corresponding to the target object.
- the image may include multiple objects including the target object, each second embedding vector corresponds to an object in the image, and one or more embedding vectors in the multiple second embedding vectors may correspond to the target object.
- the "correspondence" here can be understood as the second embedding vector is used to describe the characteristics of an object in the image, and the second embedding vector obtained by the neural network can distinguish different objects in the image, so that the subsequent prediction process can be based on the object granularity.
- This is equivalent to changing the image features from pixel granularity to target object granularity, that is, introducing target integrity constraints in cross-modal feature fusion, fusing pixels belonging to the same target as a whole with language encoding, and activating instance areas based on targets.
- This can effectively solve the problems of inaccurate target positioning and mask prediction or detection box prediction in existing language-driven precise instance segmentation methods, thereby improving the processing accuracy of the model.
- the predicted area is a mask area or a detection box.
- the semantics of the text corresponds to the target object, specifically including: the semantics of the text is used to describe the characteristics of the target object.
- the processing module is specifically configured to:
- the third image feature and the first text feature are fused through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- the first image feature is a feature that is upsampled to a size consistent with the image.
- the neural grid includes multiple transformer layers.
- the present application provides a data processing device, comprising:
- a processing module used for obtaining a first image feature corresponding to an image and a text feature corresponding to a text; the semantics of the text corresponds to a target object, and the text indicates a region corresponding to the target object predicted from the image; the first image feature and the text feature are obtained according to a feature extraction network;
- each second embedding vector corresponds to an object in the image
- each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature
- An updating module is used to update the feature extraction network and the neural network according to the difference between the predicted area and the real area corresponding to the target object in the image.
- the predicted area is a mask area or a detection box.
- the semantics of the text corresponds to the target object, specifically including: the semantics of the text is used to describe the characteristics of the target object.
- the processing module is specifically configured to:
- the third image feature and the first text feature are fused through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- an embodiment of the present application provides a data processing device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to execute the above-mentioned first aspect and any optional method thereof, and the above-mentioned second aspect and any optional method thereof.
- an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored.
- the computer-readable storage medium When the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and any optional method thereof, and the above-mentioned second aspect and any optional method thereof.
- an embodiment of the present application provides a computer program which, when executed on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof, and the above-mentioned second aspect and any optional method thereof.
- the present application provides a chip system, which includes a processor for supporting the execution of a data processing device to implement the functions involved in the above aspects, such as sending or processing the data involved in the above methods; or information.
- the chip system also includes a memory, which is used to store program instructions and data necessary for the execution device or training device.
- the chip system can be composed of chips, or it can include chips and other discrete devices.
- FIG1A is a schematic diagram of a structure of an artificial intelligence main framework
- FIG. 1B to FIG. 1C are schematic diagrams of the application system framework of the present application.
- FIG1D is a schematic diagram of an optional hardware structure of a terminal
- FIG2 is a schematic diagram of the structure of a server
- FIG3 is a schematic diagram of a system architecture of the present application.
- FIG4 is a process of a cloud service
- FIG5 is a schematic diagram of a network structure
- FIG6 is a flowchart of a data processing method provided in an embodiment of the present application.
- FIGS. 7 to 10 are schematic diagrams of a process of a data processing method provided in an embodiment of the present application.
- FIG. 11A and FIG. 11B are schematic diagrams of a beneficial effect of the present application.
- FIG12 is a schematic diagram of a structure of a data processing device provided in an embodiment of the present application.
- FIG13 is a schematic diagram of a structure of an execution device provided in an embodiment of the present application.
- FIG14 is a schematic diagram of a structure of a training device provided in an embodiment of the present application.
- FIG. 15 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
- the terms “substantially,””about,” and similar terms are used as terms of approximation rather than as terms of degree, and are intended to take into account the inherent variations in measurements or calculations that will be known to those of ordinary skill in the art.
- the use of “may” when referring to embodiments means “one or more possible embodiments.”
- the terms “use,””using,” and “used” may be considered synonymous with the terms “utilize,””utilizing,” and “utilized,” respectively.
- the term “exemplary” is intended to refer to an example or illustration.
- Figure 1A shows a structural diagram of the main framework of artificial intelligence.
- the following is an explanation of the above artificial intelligence theme framework from the two dimensions of "intelligent information chain” (horizontal axis) and “IT value chain” (vertical axis).
- the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensation process of "data-information-knowledge-wisdom".
- the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecology process of the system.
- the infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and is supported by the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
- smart chips CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips
- the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc.
- sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
- the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.
- Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
- machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training.
- Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
- some general capabilities can be further formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
- Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, smart cities, etc.
- the present application can be applied to the field of image processing in the field of artificial intelligence. Taking image processing as an example, multiple application scenarios of products will be introduced below.
- image processing can be used as a core algorithm module of the robot's visual language navigation system.
- a user wants to command a household robot to walk to a chair and pick up a vase through language instructions.
- the robot needs to accurately segment or detect the target instances of the chair and the vase described in the language before it can complete the task of picking up the vase.
- image processing functions can be applied to autonomous driving platforms.
- the recognition module of the intelligent driving system needs to first understand the user's natural language instructions and accurately segment or detect the yellow car to complete the user's request.
- image processing functions can be applied to interactive image editing systems, which need to modify images based on the user's natural language description.
- the image processing function can locate the area that the user wants to modify, and then combine it with existing image editing tools to modify the image content.
- image processing applications applications with image processing functions (hereinafter referred to as image processing applications) or cloud services provided by cloud-side servers, etc., which are introduced below:
- the product form of the embodiment of the present application can be an image processing application, in particular, a language-driven image processing application.
- the language-driven image processing application can be run on a terminal device or a server on the cloud side.
- a language-driven image processing application can perform tasks such as image segmentation or target detection based on the input image and text to obtain processing results.
- the processing results can be image segmentation results (mask area) and detection boxes.
- the image segmentation results (mask area) and the detection box can contain objects indicated by the semantics of the text (such as the target object in the embodiment of the present application).
- a user can open an image processing application installed on a terminal device and input images and text.
- the image processing application can process the image and text using the method provided in an embodiment of the present application and present the processing results to the user (the presentation method may be but is not limited to display, saving, uploading to the cloud, etc.).
- a user can open an image processing application installed on a terminal device and input an image.
- the image processing application can send the image to a server on the cloud side.
- the server on the cloud side processes the image and text using the method provided in an embodiment of the present application and transmits the processing result back to the terminal device.
- the terminal device can present the processing result to the user (the presentation method may be but is not limited to display, saving, uploading to the cloud side, etc.).
- FIG. 1B is a schematic diagram of the functional architecture of an image processing application in an embodiment of the present application:
- an image processing application 102 may receive input parameters 101 (e.g., including an image) and generate a processing result 103.
- the image processing application 102 may be executed on (for example) at least one computer system and includes computer codes, which, when executed by one or more computers, cause the computers to execute the method provided in the embodiments of the present application.
- FIG. 1C is a schematic diagram of the physical architecture for running an image processing application in an embodiment of the present application:
- FIG. 1C shows a schematic diagram of a system architecture.
- the system may include a terminal 100 and a server 200.
- the server 200 may include one or more servers (FIG. 1C is illustrated by taking one server as an example), and the server 200 may provide the method provided in the embodiment of the present application for one or more terminals.
- an image processing application can be installed on the terminal 100.
- the above application and web page can provide an interface.
- the terminal 100 can receive relevant parameters entered by the user on the language-driven image processing interface and send the above parameters to the server 200.
- the server 200 can obtain the processing results based on the received parameters and return the processing results to the terminal 100.
- the terminal 100 can also complete the action of obtaining the processing result based on the received parameters by itself without the cooperation of the server, and the embodiments of the present application are not limited to this.
- the terminal 100 in the embodiment of the present application can be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), etc., and the embodiment of the present application does not impose any limitation on this.
- AR augmented reality
- VR virtual reality
- UMPC ultra-mobile personal computer
- PDA personal digital assistant
- FIG. 1D shows a schematic diagram of an optional hardware structure of the terminal 100 .
- the terminal 100 may include components such as a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, and a power supply 190.
- a radio frequency unit 110 such as a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, and a power supply 190.
- FIG. 1D is merely an example of a terminal or a multi-function device, and does not constitute a limitation on the terminal or the multi-function device, and may include more or less components than shown in the figure, or combine certain components, or different components.
- the input unit 130 can be used to receive input digital or character information, and generate key signal input related to the user settings and function control of the portable multifunctional device.
- the input unit 130 may include a touch screen 131 (optional) and/or other input devices 132.
- the touch screen 131 can collect user touch operations on or near it (such as operations performed by the user using fingers, joints, stylus, or any other suitable object on or near the touch screen), and drive the corresponding connection device according to a pre-set program.
- the touch screen can detect the user's touch operation. For the touch action on the touch screen, the touch action is converted into a touch signal and sent to the processor 170, and the command sent by the processor 170 can be received and executed; the touch signal at least includes the touch point coordinate information.
- the touch screen 131 can provide an input interface and an output interface between the terminal 100 and the user.
- the touch screen can be implemented in various types such as resistive, capacitive, infrared and surface acoustic wave.
- the input unit 130 can also include other input devices.
- other input devices 132 can include but are not limited to one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, etc.
- the input device 132 may receive input images, texts, and the like.
- the display unit 140 may be used to display information input by the user or provided to the user, various menus of the terminal 100, interactive interfaces, file display, and/or playback of any multimedia file.
- the display unit 140 may be used to display the interface of an image processing application, processing results, etc.
- the memory 120 can be used to store instructions and data.
- the memory 120 can mainly include an instruction storage area and a data storage area.
- the data storage area can store various data, such as multimedia files, texts, etc.;
- the instruction storage area can store software units such as operating systems, applications, instructions required for at least one function, or their subsets and extensions. It can also include non-volatile random access memory; provide the processor 170 with hardware, software and data resources including management of computing and processing equipment, and support control software and applications. It is also used for the storage of multimedia files, and the storage of running programs and applications.
- the processor 170 is the control center of the terminal 100. It uses various interfaces and lines to connect various parts of the entire terminal 100. By running or executing instructions stored in the memory 120 and calling data stored in the memory 120, it executes various functions of the terminal 100 and processes data, thereby controlling the terminal device as a whole.
- the processor 170 may include one or more processing units; preferably, the processor 170 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface and application program, and the modem processor mainly processes wireless communication. It is understandable that the above-mentioned modem processor may not be integrated into the processor 170.
- the processor and the memory may be implemented on a single chip, and in some embodiments, they may also be implemented separately on separate chips.
- the processor 170 may also be used to generate corresponding operation control signals, send them to corresponding components of the computing and processing device, read and process data in the software, especially read and process data and programs in the memory 120, so that each functional module therein performs corresponding functions, thereby controlling the corresponding components to act according to the requirements of the instructions.
- the memory 120 can be used to store software codes related to the data processing method
- the processor 170 can execute the steps of the chip data processing method, and can also schedule other units (such as the above-mentioned input unit 130 and display unit 140) to realize corresponding functions.
- the RF unit 110 (optional) can be used for receiving and sending information or receiving and sending signals during a call, for example, after receiving the downlink information of the base station, it is sent to the processor 170 for processing; in addition, the designed uplink data is sent to the base station.
- the RF circuit includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA), a duplexer, etc.
- the RF unit 110 can also communicate with network devices and other devices through wireless communication.
- the wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), etc.
- GSM Global System of Mobile communication
- GPRS General Packet Radio Service
- CDMA Code Division Multiple Access
- WCDMA Wideband Code Division Multiple Access
- LTE Long Term Evolution
- email Short Messaging Service
- the RF unit 110 can send the image to the server 200 and receive the processing result sent by the server 200.
- radio frequency unit 110 is optional and can be replaced by other communication interfaces, such as a network port.
- the terminal 100 also includes a power supply 190 (such as a battery) for supplying power to various components.
- a power supply 190 such as a battery
- the power supply can be logically connected to the processor 170 through a power management system, so that the power management system can manage functions such as charging, discharging, and power consumption.
- the terminal 100 also includes an external interface 180, which can be a standard Micro USB interface or a multi-pin connector. It can be used to connect the terminal 100 to communicate with other devices, and can also be used to connect a charger to charge the terminal 100.
- an external interface 180 can be a standard Micro USB interface or a multi-pin connector. It can be used to connect the terminal 100 to communicate with other devices, and can also be used to connect a charger to charge the terminal 100.
- the terminal 100 may also include a flash, a wireless fidelity (WiFi) module, a Bluetooth module, sensors with different functions, etc., which are not described in detail here. Some or all of the methods described below may be applied to the terminal 100 as shown in FIG. 1D .
- WiFi wireless fidelity
- Bluetooth Bluetooth
- FIG2 provides a schematic diagram of the structure of a server 200.
- the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204.
- the processor 202, the memory 204, and the communication interface 203 communicate with each other via the bus 201.
- the bus 201 may be a peripheral component interconnect (PCI) bus or an extended industry standard bus.
- PCI peripheral component interconnect
- the bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, only one thick line is used in FIG2 , but it does not mean that there is only one bus or one type of bus.
- the processor 202 may be any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
- CPU central processing unit
- GPU graphics processing unit
- MP microprocessor
- DSP digital signal processor
- the memory 204 may include a volatile memory (volatile memory), such as a random access memory (RAM).
- volatile memory such as a random access memory (RAM).
- RAM random access memory
- non-volatile memory non-volatile memory
- ROM read-only memory
- flash memory flash memory
- HDD hard drive
- SSD solid state drive
- the memory 204 may be used to store software codes related to the data processing method, and the processor 202 may execute the steps of the data processing method of the chip, and may also schedule other units to implement corresponding functions.
- the above-mentioned terminal 100 and server 200 can be centralized or distributed devices, and the processors in the above-mentioned terminal 100 and server 200 (such as processor 170 and processor 202) can be hardware circuits (such as application specific integrated circuit (ASIC), field-programmable gate array (FPGA), general-purpose processor, digital signal processor (DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits.
- the processor can be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without an instruction execution function and hardware systems with an instruction execution function.
- the steps related to the model reasoning process in the embodiments of the present application involve AI-related operations.
- the instruction execution architecture of the terminal device and the server is not limited to the processor combined with the memory architecture described above.
- the system architecture provided in the embodiments of the present application is described in detail below in conjunction with Figure 3.
- FIG3 is a schematic diagram of a system architecture provided by an embodiment of the present application.
- a system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 , and a data acquisition system 560 .
- the execution device 510 includes a calculation module 511, an I/O interface 512, a preprocessing module 513 and a preprocessing module 514.
- the calculation module 511 may include a target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.
- the execution device 510 may be a terminal device or a server that runs the above-mentioned image processing application.
- the data acquisition device 560 is used to acquire training samples.
- the training samples may be multiple images, etc.
- the data acquisition device 560 stores the training samples in the database 530 .
- the training device 520 can train the neural network to be trained (such as the cross-modal language model in the embodiment of the present application (such as a text encoder, an image encoder, a target encoder, etc.)) based on the training samples maintained in the database 530 to obtain the target model/rule 501.
- the neural network such as the cross-modal language model in the embodiment of the present application (such as a text encoder, an image encoder, a target encoder, etc.)
- the neural network to be trained such as the cross-modal language model in the embodiment of the present application (such as a text encoder, an image encoder, a target encoder, etc.)
- the training device 520 can perform a pre-training process on the neural network to be trained based on the training samples maintained in the database 530, or fine-tune the model based on the pre-training.
- the training samples maintained in the database 530 may not all come from the data acquisition device 560, but may also be received from other devices. It should also be noted that the training device 520 may not train the target model/rule 501 entirely based on the training samples maintained in the database 530, but may also obtain training samples from the cloud or other places for model training. The above description should not be used as a limitation on the embodiments of the present application.
- the target model/rule 501 trained by the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG3 .
- the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an augmented reality (AR)/virtual reality (VR) device, a vehicle terminal, etc., and can also be a server, etc.
- AR augmented reality
- VR virtual reality
- the training device 520 may transfer the trained model to the execution device 510 .
- the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with an external device.
- the user can input data (such as images in the embodiments of the present application) to the I/O interface 512 through the client device 540.
- the preprocessing module 513 and the preprocessing module 514 are used to preprocess the input data received by the I/O interface 512. It should be understood that there may be no preprocessing module 513 and the preprocessing module 514 or only one preprocessing module. When there is no preprocessing module 513 and the preprocessing module 514, the computing module 511 may be directly used to process the input data.
- the execution device 510 When the execution device 510 pre-processes the input data, or when the calculation module 511 of the execution device 510 performs calculations and other related processing, the execution device 510 can call the data, code, etc. in the data storage system 550 for the corresponding processing, and can also store the corresponding processing data. The processed data, instructions, etc. are stored in the data storage system 550.
- the I/O interface 512 provides the processing results to the client device 540 and thus to the user.
- the user can manually give input data, and the “manually given input data” can be operated through the interface provided by the I/O interface 512.
- the client device 540 can automatically send input data to the I/O interface 512. If the client device 540 is required to automatically send input data and needs to obtain the user's authorization, the user can set the corresponding authority in the client device 540. The user can view the results output by the execution device 510 on the client device 540, and the specific presentation form can be a specific method such as display, sound, action, etc.
- the client device 540 can also be used as a data acquisition terminal to collect the input data of the input I/O interface 512 and the output results of the output I/O interface 512 as shown in the figure as new sample data, and store them in the database 530.
- the I/O interface 512 directly stores the input data of the input I/O interface 512 and the output results of the output I/O interface 512 as new sample data in the database 530.
- FIG. 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the above-mentioned execution device 510 can be deployed in the client device 540.
- the computing module 511 of the above-mentioned execution device 520 can obtain the code stored in the data storage system 550 to implement the steps related to the model reasoning process in the embodiment of the present application.
- the computing module 511 of the execution device 520 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits.
- the training device 520 may be a hardware system with an execution instruction function, such as a CPU, a DSP, etc., or a hardware system without an execution instruction function, such as an ASIC, an FPGA, etc., or a combination of the above-mentioned hardware systems without an execution instruction function and hardware systems with an execution instruction function.
- the computing module 511 of the execution device 520 can be a hardware system with an execution instruction function, and the steps related to the model reasoning process provided in the embodiment of the present application can be software codes stored in the memory.
- the computing module 511 of the execution device 520 can obtain the software code from the memory and execute the obtained software code to implement the steps related to the model reasoning process provided in the embodiment of the present application.
- the computing module 511 of the execution device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to the model reasoning process provided in the embodiments of the present application can also be implemented by the hardware system that does not have the function of executing instructions in the computing module 511 of the execution device 520, which is not limited here.
- the above-mentioned training device 520 can obtain the code stored in the memory (not shown in Figure 3, which can be integrated into the training device 520 or deployed separately from the training device 520) to implement the steps related to model training in an embodiment of the present application.
- the training device 520 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits.
- the training device 520 may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without an instruction execution function and hardware systems with an instruction execution function.
- the training device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to the model training provided in the embodiments of the present application can also be implemented by the hardware system that does not have the function of executing instructions in the training device 520, which is not limited here.
- the server can provide language-driven image processing services to the end side through an application programming interface (API).
- API application programming interface
- the terminal device can send relevant parameters (such as images) to the server through the API provided by the cloud, and the server can Based on the received parameters, the processing results are obtained, etc.), and the processing results are returned to the terminal.
- relevant parameters such as images
- the server can Based on the received parameters, the processing results are obtained, etc.
- FIG. 4 shows a process of using a language-driven image processing cloud service provided by a cloud platform.
- SDK software development kit
- the cloud platform provides multiple development versions of the SDK for users to choose according to the requirements of the development environment, such as JAVA version SDK, Python version SDK, PHP version SDK, Android version SDK, etc.
- the SDK project is imported into the local development environment, and configuration and debugging are performed in the local development environment.
- the local development environment can also be used to develop other functions, thus forming an application that integrates language-driven image processing capabilities.
- the language-driven image processing API call can be triggered.
- the application triggers the language-driven image processing function an API request is initiated to the running instance of the language-driven image processing service in the cloud environment, wherein the API request carries an image, and the running instance in the cloud environment processes the image to obtain the processing result.
- the cloud environment returns the processing results to the application, thereby completing a method call provided in an embodiment of the present application.
- a neural network may be composed of neural units, and a neural unit may refer to an operation unit that takes xs (i.e., input data) and intercept 1 as input, and the output of the operation unit may be:
- n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal.
- the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
- a neural network is a network formed by connecting multiple single neural units mentioned above, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected to the local receptive field of the previous layer to extract the characteristics of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- FIG5 is a schematic diagram of the architecture of a transformer layer.
- the neural network includes an embedding layer and at least one transformer layer, and the at least one transformer layer can be N transformer layers (N is an integer greater than 0), wherein each transformer layer includes an attention layer, an add&norm layer, a feed forward layer, and an add&norm layer that are adjacent in sequence.
- the current input is embedded to obtain multiple embedding vectors;
- P input vectors are obtained from the previous layer of the first transformer layer, and the first input vector among the P input vectors is taken as the center, and the intermediate vector corresponding to the first input vector is obtained based on the correlation between each input vector within the preset attention window range and the first input vector, so as to determine the P intermediate vectors corresponding to the P input vectors;
- the P intermediate vectors are merged into Q output vectors, wherein the multiple output vectors obtained by the last transformer layer in the transformer layer are used as the feature representation of the current input.
- the attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sensations to increase the observation precision of some areas, and can use limited attention resources to quickly filter out high-value information from a large amount of information.
- the attention mechanism can quickly extract important features of sparse data, and is therefore widely used in natural language processing tasks, especially machine translation.
- the self-attention mechanism is an improvement on the attention mechanism, which reduces dependence on external information and is better at capturing the internal correlation of data or features.
- the essential idea of the attention mechanism can be rewritten as the following formula:
- Lx
- represents the length of Source.
- the formula means that the elements in Source are imagined as a series of data. For the composition, given an element Query in the target Target, by calculating the similarity or correlation between Query and each Key, the weight coefficient of the Value corresponding to each Key is obtained, and then the Value is weighted and summed to obtain the final Attention value. So in essence, the Attention mechanism is to perform a weighted summation of the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.
- Attention can be understood as selectively screening out a small amount of important information from a large amount of information and focusing on these important information, ignoring most of the unimportant information.
- the process of focusing is reflected in the calculation of the weight coefficient.
- the self-attention mechanism can be understood as internal Attention (intra attention).
- the Attention mechanism occurs between the Query element of the Target and all the elements in the Source.
- the specific calculation process is the same, but the calculation object has changed.
- NLP Natural language processing
- Natural language is human language, and natural language processing (NLP) is the processing of human language. Natural language processing is the process of systematically analyzing, understanding, and extracting information from text data in an intelligent and efficient way.
- NLP natural language processing
- MT machine translation
- NER named entity recognition
- RE relation extraction
- IE information extraction
- sentiment analysis speech recognition
- question answering and topic segmentation, etc.
- the pre-trained language model is a natural language sequence encoder that encodes each word in a natural language sequence into a vector representation for prediction tasks. Its training consists of two stages. In the pre-training stage, the model is trained on language model tasks on large-scale unsupervised text to learn a word representation. In the fine-tuning stage, the model is initialized using the parameters learned in the pre-training stage, and is trained on downstream tasks such as text classification and sequence labeling with fewer steps, so that the semantic information obtained from pre-training can be successfully transferred to downstream tasks.
- Convolutional neural networks can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller.
- BP error back propagation
- the forward transmission of the input signal to the output will generate error loss, and the error loss information is back-propagated to update the parameters in the initial super-resolution model, so that the error loss converges.
- the back propagation algorithm is a back propagation movement dominated by error loss, which aims to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
- Language-driven precise instance segmentation is a special semantic segmentation technology. It refers to accurately segmenting the instance targets described by language in an image according to the guidance of natural language. Its characteristics are: 1) Traditional semantic segmentation models predict the same label for all targets belonging to the same category and do not distinguish different targets in the same category, while language-driven precise instance segmentation needs to accurately identify an instance target corresponding to the language description from multiple similar targets; 2) The semantic segmentation model needs to pre-define a set of semantic category labels in order to learn to segment targets of these categories, while language-driven precise instance segmentation can accept more flexible natural language input and is not limited to target categories.
- the present application provides a data processing method.
- the data processing method of the present application is described in detail below with reference to the accompanying drawings.
- FIG. 6 is a flow chart of a data processing method provided in an embodiment of the present application.
- a data processing method provided in an embodiment of the present application may include steps 601 to 603, and these steps are described in detail below.
- the semantics of the text may indicate determining a mask region corresponding to a target object in an image (image segmentation task) or a detection box (object detection task).
- the semantics of the text is used to describe the features of the target object. For example, if the image includes two vases, one red and one yellow, the text may be a red vase. For example, if the image includes two vases, one on the left and one on the right of the image, the text may be a vase on the left.
- feature extraction and alignment may be performed on the image and the text to obtain a first image feature corresponding to the image and a text feature corresponding to the text.
- obtaining the first image feature corresponding to the image and the text feature corresponding to the text specifically includes: processing the image through an image encoder to obtain the image feature corresponding to the image, processing the text through a text encoder to obtain the first text feature corresponding to the text, and fusing the third image feature and the first text feature through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- the image encoder f v (or visual encoder) can use Swin Transformer as a visual encoder to extract multi-level visual features for a given visual image I, and convert the multi-scale visual features generated by multiple stages (taking four stages as an example) in Swin Transformer into
- the text encoder fl can use a multi-layer BERT (taking 12 layers as an example) as a language encoder.
- the bidirectional attention mechanism can be implemented using the Word-pixel alignment module.
- the Word-pixel alignment module is a cross-modal thresholded bidirectional attention module that aligns visual and language features in the feature space during the encoding phase of images and sentences.
- the learnable feature threshold mechanism is to prevent the original feature information from being overwhelmed when updating the fused features.
- the alignment effect of the Word-pixel alignment can be shown in Figure 8.
- the language information is integrated into the visual encoding during the visual and language information encoding phase, and the visual information is integrated into the language encoding, so that the word features of the sentence and the corresponding pixel features in the picture are correlated in the cross-modal feature space.
- the cross-modal bidirectional attention module BiAttn interacts visual and language features in the feature space.
- This module is used to fuse the visual and language features of each stage of the dual encoder.
- its operation is defined as follows:
- V′ i ,L′ i BiAttn(V i ,L i ),i ⁇ 1,...,4 ⁇ ;
- d k is the dimension of the joint visual language embedding space
- W v , W l , W′ v , W′ l are all projection matrices.
- Fi represents the fused features from the BCA module
- F′i represents the suppressed fused features
- ⁇ represents the matrix element-by-element multiplication.
- MLP is a two-layer perceptron, the first layer is a linear layer, followed by a ReLU activation function, and the second layer is a linear layer, followed by a hyperbolic tangent activation function.
- a multi-head attention layer can be used to fuse high-level features from the visual language encoder.
- the high-level visual features V o and language features L o are projected into the same feature space, and then they are concatenated into a fused feature F o , which is then sent to the cross-attention layer.
- a learnable position vector ep is added to the projected visual features.
- the cross-attention layer outputs the feature S o .
- a segmentation head in order to upsample the pixel-level features to the original image size to obtain the final segmentation map, a segmentation head can be constructed.
- the input of the segmentation head can be S o and the multi-scale visual features Then get the following output:
- ⁇ is a two-layer convolutional network
- each layer is a 3 ⁇ 3 convolution plus ReLU and batch normalization
- Up represents bilinear interpolation upsampling
- ⁇ represents 1*1 convolution
- Feature projection is performed on each pixel of the segmentation head, and the output of the segmentation head is recorded as Y 1 .
- each second embedding vector corresponds to an object in the image; each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature.
- the neural network may process the plurality of first embedding vectors into a plurality of second embedding vectors according to the first image feature, wherein each second embedding vector may correspond to a candidate region of the target object, and different second embedding vectors may correspond to different or overlapping candidate regions of the target object.
- This is equivalent to changing the image features from pixel granularity to target object granularity, that is, introducing target integrity constraints in cross-modal feature fusion, fusing pixels belonging to the same target as a whole with language encoding, and activating instance areas based on targets.
- This can effectively solve the problems of inaccurate target positioning and mask prediction or detection box prediction in existing language-driven precise instance segmentation methods, thereby improving segmentation accuracy.
- the neural grid includes multiple transformer layers.
- the Sentence-object alignment module first generates possible target masks based on the word-pixel aligned features, and then aligns the natural language sentence features and the target masks to more accurately locate the target instance, as shown in Figure 9.
- the embodiment of the present application designs a mask generator MaskGenerator, which predicts N possible target masks based on the output So of the encoder.
- a weight can be assigned to each second embedding vector according to the text features, and the second embedding vector with a higher weight is most likely to contain the target object referred to by the text.
- a weight Q w can be assigned to each mask vector (second embedding vector) of Q o according to the text feature L g .
- a higher weight in Q w indicates that the corresponding mask is most likely to contain the object referred to by the language.
- N mask predictions Y N are obtained by multiplying Q o and Y 1.
- Y N and Q w are multiplied to obtain the final mask prediction M.
- sim(,) represents the cosine similarity function
- FIG7 is a schematic diagram of a network architecture of an embodiment of the present application, wherein the overall architecture design follows the classic encoder-decoder paradigm.
- the encoder part consists of a visual encoder and a language encoder to extract visual and language features, and the Word-Pixel Alignment (WPA) module is in the middle layer of the visual and language encoding to achieve cross-modal interaction.
- WPA Word-Pixel Alignment
- a cross-attention layer is used to cross-modally fuse the outputs of the visual and language encoders.
- the decoder part consists of a mask generator that generates N mask query vectors, a segmentation head that upsamples pixel features, and a Sentence-Object Alignment module (SOA), which weights the output mask query vector according to the sentence features, and uses the weights to perform weighted summation on the segmentation features generated by the segmentation head to obtain the final segmentation mask.
- SOA Sentence-Object Alignment module
- each pixel in the image must be classified as foreground or background, so this task can be regarded as a pixel-level binary classification task.
- this task can be regarded as a pixel-level binary classification task.
- M′ the values of each point i are m′ i and
- the segmentation loss is as follows:
- ⁇ represents the sigmoid function and j represents the jth image in the training batch.
- a pixel-level contrast loss function can be used as an auxiliary function of the segmentation loss. This function reduces the distance between pixel features within the target object and increases the distance between pixel features within the target object and pixel features outside the object, as shown in Figure 10.
- an image and a natural language sentence describing an instance target in the image are input.
- the model will directly predict the mask M or detection box of the instance target, upsample and interpolate it back to the original image size, and binarize it to segment the instance target.
- RefCOCO RefCOCO
- RefCOCO+ RefCOCOg
- G-Ref RefCOCOg
- the images of these three datasets are all from the MSCOCO dataset, and each is accompanied by different language annotations.
- the language annotations of RefCOCO and RefCOCO+ are generated through a game called ReferitGame.
- RefCOCO consists of 142,209 natural language annotations and 19,994 images
- RefCOCO+ consists of 141,564 natural language annotations and 19,992 images.
- the main difference between RefCOCO and RefCOCO+ is that RefCOCO+ does not allow the use of positioning words such as "left" and "front” in the language annotations.
- the RefCOCO+ dataset is more challenging.
- the language annotations on the G-Ref dataset are from Amazon Mechanical Truk, which contains 85,474 language annotations and 26,711 images.
- this dataset has two division methods, namely UMD division and Google division.
- UMD division and Google division the language annotations of G-Ref are more complex and varied, and the average length of its sentences is also greater than the average length of sentences in the RefCOCO and RefCOCO+ datasets, which makes G-Ref a more challenging dataset.
- the original input data is an RGB image, a 0-1 mask matrix, and a language annotation string.
- the preprocessing of the image part of the data is as follows: for the training data, the RGB image is normalized, and after regularization, it is scaled to a uniform resolution of 448*448 through bilinear interpolation. At the same time, the 0-1 mask matrix is scaled to the same resolution as the RGB image through nearest neighbor interpolation. For the test data, only the above processing is required for the RGB image, and there is no need to perform nearest neighbor interpolation on the 0-1 mask matrix.
- the preprocessing of the language part of the data is as follows: Use BertTokenizer in the HuggingFace library to tokenize the input string.
- BertTokenizer is based on the WordPiece embedding method, and its dictionary size is 30,000. For each tokenized sequence, the first token is a special [CLS] token. If multiple sentences are input, another special [SEP] token will be inserted between sentences.
- intersection over union is mainly used to indicate the similarity between the predicted area and the true area.
- intersection-over-union ratio is defined as follows:
- the global IoU is the sum of the intersections of all test images divided by the sum of their unions.
- the average IoU is the average of the IoUs of all test images.
- prec@X is the percentage of images with an IoU greater than a certain threshold X among all test images. In the experiment, the value of X is usually 0.5, 0.6, 0.7, 0.8, or 0.9.
- CoupAlign is compared with previous SOTA methods on RefCOCO and G-Ref in terms of oIoU.
- the language annotations provided by the RefCOCO dataset contain many positional words, such as: "The closest girl on the right". This requires the model to not only understand the correspondence between nouns and objects, but also understand the positional relationship between objects represented by positional words.
- CoupAlign improves 1.97%, 1.94%, and 1.79% on val, testA, and testB of RefCOCO, respectively.
- the language annotations on G-Ref have more complex grammatical structures than RefCOCO, and the average sentence length is also longer.
- Word pixel alignment enables cross-modal interaction to occur at both the bottom and high levels of encoding.
- Table 2 it can be found that after removing the word pixel alignment module, the model's oIoU index dropped by about 4.3%, which shows that the existence of the word pixel alignment module in the encoding stage is very necessary.
- the model's oIoU index dropped by about 2%. This shows that not only attention from language to vision, but also attention from vision to language is very important.
- the model's oIoU index dropped by about 1.7%.
- the mask prediction is visualized in order from large to small similarity to the sentence semantics.
- the greater the similarity the greater the overlap between the mask and the target object, and the smaller the similarity, the smaller the overlap between the mask and the target object.
- Figure 11A it can be seen that the sentence target alignment module allows the model to pay attention to different objects, thereby perceiving the positional relationship between objects. And because the target integrity constraint is introduced, the segmentation prediction of the model is less likely to produce hollowing, fragmentation and other phenomena.
- the present application provides a data processing method, the method comprising: obtaining a first image feature corresponding to an image and a text feature corresponding to a text.
- the corresponding text features; the semantics of the text corresponds to the target object, and the text indicates the area corresponding to the target object predicted from the image; according to the preset multiple first embedding vectors and the first image features, multiple second embedding vectors are obtained through a neural network, each second embedding vector corresponds to an object in the image; each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature; according to the similarity between the text features and the multiple second embedding vectors, the weight corresponding to each second embedding vector is determined, and the multiple weights are used to fuse with the multiple second image features to determine the predicted area corresponding to the target object.
- the present application also provides a data processing method, the method comprising:
- each second embedding vector corresponds to an object in the image
- each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature
- the feature extraction network and the neural network are updated according to the difference between the predicted area and the real area corresponding to the target object in the image.
- the predicted area is a mask area or a detection box.
- the semantics of the text corresponds to the target object, specifically including: the semantics of the text is used to describe the characteristics of the target object.
- acquiring a first image feature corresponding to the image and a text feature corresponding to the text includes:
- the third image feature and the first text feature are fused through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- FIG. 12 is a schematic diagram of the structure of a data processing device provided in an embodiment of the present application.
- a data processing device 1200 provided in an embodiment of the present application includes:
- the processing module 1201 is used to obtain a first image feature corresponding to an image and a text feature corresponding to a text; the semantics of the text corresponds to a target object, and the text indicates a region corresponding to the target object predicted from the image;
- each second embedding vector corresponds to an object in the image
- each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature
- a weight corresponding to each of the second embedding vectors is determined, and the multiple weights are used to be fused with the multiple second image features to determine the predicted area corresponding to the target object.
- processing module 1201 can refer to the description of steps 601 to 603 in the above embodiment, which will not be repeated here.
- the image may include multiple objects including the target object, each second embedding vector corresponds to an object in the image, and one or more embedding vectors in the multiple second embedding vectors may correspond to the target object.
- the "correspondence" here can be understood as the second embedding vector is used to describe the characteristics of an object in the image, and the second embedding vector obtained by the neural network can distinguish different objects in the image, so that the subsequent prediction process can be based on the object granularity.
- This is equivalent to changing the image features from pixel granularity to target object granularity, that is, introducing target integrity constraints in cross-modal feature fusion, fusing pixels belonging to the same target as a whole with language encoding, and activating instance regions based on targets.
- This can effectively solve the problems of inaccurate target positioning and mask prediction or detection box prediction in existing language-driven precise instance segmentation methods, thereby Improve the processing accuracy of the model.
- the predicted area is a mask area or a detection box.
- the semantics of the text corresponds to the target object, specifically including: the semantics of the text is used to describe the characteristics of the target object.
- the processing module is specifically configured to:
- the third image feature and the first text feature are fused through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- the first image feature is a feature that is upsampled to a size consistent with the image.
- the neural grid includes multiple transformer layers.
- the present application also provides a data processing device, including:
- a processing module used for obtaining a first image feature corresponding to an image and a text feature corresponding to a text; the semantics of the text corresponds to a target object, and the text indicates a region corresponding to the target object predicted from the image; the first image feature and the text feature are obtained according to a feature extraction network;
- each second embedding vector corresponds to an object in the image
- each second embedding vector and the first image feature are used to fuse to obtain a corresponding second image feature
- An updating module is used to update the feature extraction network and the neural network according to the difference between the predicted area and the real area corresponding to the target object in the image.
- the image may include multiple objects including the target object, each second embedding vector corresponds to an object in the image, and one or more embedding vectors in the multiple second embedding vectors may correspond to the target object.
- the "correspondence" here can be understood as the second embedding vector is used to describe the characteristics of an object in the image, and the second embedding vector obtained by the neural network can distinguish different objects in the image, so that the subsequent prediction process can be based on the object granularity.
- This is equivalent to changing the image features from pixel granularity to target object granularity, that is, introducing target integrity constraints in cross-modal feature fusion, fusing pixels belonging to the same target as a whole with language encoding, and activating instance areas based on targets.
- This can effectively solve the problems of inaccurate target positioning and mask prediction or detection box prediction in existing language-driven precise instance segmentation methods, thereby improving the processing accuracy of the model.
- the predicted area is a mask area or a detection box.
- the semantics of the text corresponds to the target object, specifically including: the semantics of the text is used to describe the characteristics of the target object.
- the processing module is specifically configured to:
- the third image feature and the first text feature are fused through a bidirectional attention mechanism to obtain the first image feature corresponding to the image and the text feature corresponding to the text.
- FIG. 13 is a structural schematic diagram of an execution device provided in an embodiment of the present application.
- the execution device 1300 can be specifically manifested as a virtual reality VR device, a mobile phone, a tablet, a laptop, a smart wearable device, a monitoring data processing device or a server, etc., which is not limited here.
- the execution device 1300 includes: a receiver 1301, a transmitter 1302, a processor 1303 and a memory 1304 (wherein the number of processors 1303 in the execution device 1300 can be one or more, and one processor is taken as an example in Figure 13), wherein the processor 1303 may include an application processor 13031 and a communication processor 13032.
- the receiver 1301, the transmitter 1302, the processor 1303 and the memory 1304 may be connected via a bus or other means.
- the memory 1304 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1303. A portion of the memory 1304 may also include a non-volatile random access memory (NVRAM).
- NVRAM non-volatile random access memory
- the memory 1304 stores processors and operation instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
- the processor 1303 controls the operation of the execution device.
- the various components of the execution device are coupled together through a bus system, wherein the bus system includes not only a data bus but also a power bus, a control bus, and a status signal bus, etc.
- the bus system includes not only a data bus but also a power bus, a control bus, and a status signal bus, etc.
- various buses are referred to as bus systems in the figure.
- the method disclosed in the above embodiment of the present application can be applied to the processor 1303, or implemented by the processor 1303.
- the processor 1303 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 1303 or the instruction in the form of software.
- the above processor 1303 can be a general processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
- the processor 1303 can implement or execute the various methods, steps and logic block diagrams disclosed in the embodiment of the present application.
- the general processor can be a microprocessor or the processor can also be any conventional processor, etc.
- the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to be executed, or a combination of hardware and software modules in the decoding processor can be executed.
- the software module may be located in a storage medium mature in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc.
- the storage medium is located in the memory 1304, and the processor 1303 reads the information in the memory 1304 and completes the steps of the above method involving the model reasoning process in combination with its hardware.
- the receiver 1301 can be used to receive input digital or character information and generate signal input related to the relevant settings and function control of the execution device.
- the transmitter 1302 can be used to output digital or character information through the first interface; the transmitter 1302 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1302 can also include a display device such as a display screen.
- FIG. 14 is a structural diagram of a training device provided by the embodiment of the present application.
- the training device 1400 is implemented by one or more servers.
- the training device 1400 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1414 (for example, one or more processors) and a memory 1432, and one or more storage media 1430 (for example, one or more mass storage devices) storing application programs 1442 or data 1444.
- the memory 1432 and the storage medium 1430 can be short-term storage or permanent storage.
- the program stored in the storage medium 1430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device. Furthermore, the central processor 1414 can be configured to communicate with the storage medium 1430 to execute a series of instruction operations in the storage medium 1430 on the training device 1400.
- the training device 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input and output interfaces 1458; or, one or more operating systems 1441, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- the central processing unit 1414 is used to execute actions related to model training in the above embodiments.
- Also provided in an embodiment of the present application is a computer program product which, when executed on a computer, enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
- a computer-readable storage medium is also provided in an embodiment of the present application, which stores a program for signal processing.
- the computer-readable storage medium When the computer-readable storage medium is run on a computer, it enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
- the execution device, training device or terminal device provided in the embodiments of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
- the processing unit may execute the computer execution instructions stored in the storage unit so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment.
- the storage unit is a storage unit in the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.
- ROM read-only memory
- RAM random access memory
- FIG. 15 is a schematic diagram of a structure of a chip provided in an embodiment of the present application.
- the chip can be expressed as a neural network processor NPU 1500.
- NPU 1500 is mounted on the host CPU (Host CPU) as a coprocessor, and tasks are assigned by the Host CPU.
- the core part of the NPU is the operation circuit 1503, which is controlled by the controller 1504 to extract matrix data from the memory and perform multiplication operations.
- the operation circuit 1503 includes multiple processing units (Process Engine, PE) inside.
- the operation circuit 1503 is a two-dimensional systolic array.
- the operation circuit 1503 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- the operation circuit 1503 is a general-purpose matrix processor.
- the operation circuit takes the corresponding data of matrix B from the weight memory 1502 and caches it on each PE in the operation circuit.
- the operation circuit takes the matrix A data from the input memory 1501 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 1508.
- Unified memory 1506 is used to store input data and output data. Weight data is directly transferred to weight memory 1502 through Direct Memory Access Controller (DMAC) 1505. Input data is also transferred to unified memory 1506 through DMAC.
- DMAC Direct Memory Access Controller
- BIU stands for Bus Interface Unit, that is, the bus interface unit 1510, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 1509.
- IOB instruction fetch buffer
- the bus interface unit 1510 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1509 to obtain instructions from the external memory, and is also used for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- BIU Bus Interface Unit
- DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1506 or to transfer weight data to the weight memory 1502 or to transfer input data to the input memory 1501.
- the vector calculation unit 1507 includes multiple operation processing units, which further process the output of the operation circuit 1503 when necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.
- the vector calculation unit 1507 can store the processed output vector to the unified memory 1506.
- the vector calculation unit 1507 can apply a linear function; or a nonlinear function to the output of the operation circuit 1503, such as linear interpolation of the feature plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value.
- the vector calculation unit 1507 generates a normalized value, a pixel-level summed value, or both.
- the processed output vector can be used as an activation input to the operation circuit 1503, for example, for use in a subsequent layer in a neural network.
- An instruction fetch buffer 1509 connected to the controller 1504 is used to store instructions used by the controller 1504;
- Unified memory 1506, input memory 1501, weight memory 1502 and instruction fetch memory 1509 are all on-chip memories. External memories are private to the NPU hardware architecture.
- the processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.
- the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment.
- the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
- the technical solution of the present application can be essentially or in other words, the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, U disk, mobile hard disk, ROM, RAM, disk or CD, etc., including a number of instructions to enable a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods described in each embodiment of the present application. Law.
- a readable storage medium such as a computer floppy disk, U disk, mobile hard disk, ROM, RAM, disk or CD, etc.
- all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof.
- all or part of the embodiments may be implemented in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center.
- the computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations.
- the available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
- a magnetic medium e.g., a floppy disk, a hard disk, a tape
- an optical medium e.g., a DVD
- a semiconductor medium e.g., a solid-state drive (SSD)
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
Abstract
Description
V′i←Gate(V′i),L′i←Gate(Li′);
F′i=GATE(Fi)=MLP(Fi)⊙Fi;
So=CrossAttn(Fo)+Vo,Fo=[V′o;L′o],;
Qo=MaskGenerator(Q,So).;
Qw=softmax(sim(Lg,Qo)),
Claims (23)
- 一种数据处理方法,其特征在于,包括:获取图像对应的第一图像特征以及文本对应的文本特征;所述文本的语义对应于目标对象,且所述文本指示从所述图像中预测目标对象对应的区域;根据预设的多个第一嵌入向量以及所述第一图像特征,通过神经网络,得到多个第二嵌入向量,每个第二嵌入向量对应于所述图像中的一个对象;每个所述第二嵌入向量和所述第一图像特征用于融合得到一个对应的第二图像特征;根据所述文本特征和所述多个第二嵌入向量之间的相似度,确定每个所述第二嵌入向量对应的权重,多个所述权重用于和所述多个第二图像特征进行融合,以确定所述目标对象对应的预测区域。
- 根据权利要求1所述的方法,其特征在于,所述预测区域为掩码区域或者检测框。
- 根据权利要求1或2所述的方法,其特征在于,所述文本的语义对应于目标对象,具体包括:所述文本的语义用于描述所述目标对象的特征。
- 根据权利要求1至3任一所述的方法,其特征在于,所述获取图像对应的第一图像特征以及文本对应的文本特征,包括:通过图像编码器处理所述图像,得到所述图像对应的图像特征;通过文本编码器处理所述文本,得到所述文本对应的第一文本特征;通过双向注意力机制融合所述第三图像特征以及所述第一文本特征,得到所述图像对应的第一图像特征以及所述文本对应的文本特征。
- 根据权利要求1至4任一所述的方法,其特征在于,所述第一图像特征为上采样到和所述图像的尺寸一致的特征。
- 根据权利要求1至5任一所述的方法,其特征在于,所述神经网格包括多个transformer层。
- 一种数据处理方法,其特征在于,包括:获取图像对应的第一图像特征以及文本对应的文本特征;所述文本的语义对应于目标对象,且所述文本指示从所述图像中预测目标对象对应的区域;所述第一图像特征以及所述文本特征为根据特征提取网络得到的;根据预设的多个第一嵌入向量以及所述第一图像特征,通过神经网络,得到多个第二嵌入向量,每个第二嵌入向量对应于所述图像中的一个对象;每个所述第二嵌入向量和所述第一图像特征用于融合得到一个对应的第二图像特征;根据所述文本特征和所述多个第二嵌入向量之间的相似度,确定每个所述第二嵌入向量对应的权重,多个所述权重用于和所述多个第二图像特征进行融合,以确定所述目标对象对应的预测区域;根据所述预测区域和所述图像中所述目标对象对应的真实区域之间的差异,更新所述特征提取网络以及所述神经网络。
- 根据权利要求7所述的方法,其特征在于,所述预测区域为掩码区域或者检测框。
- 根据权利要求7或8所述的方法,其特征在于,所述文本的语义对应于目标对象,具体包括:所述文本的语义用于描述所述目标对象的特征。
- 根据权利要求7至9任一所述的方法,其特征在于,所述获取图像对应的第一图像特征以及文 本对应的文本特征,包括:通过图像编码器处理所述图像,得到所述图像对应的图像特征;通过文本编码器处理所述文本,得到所述文本对应的第一文本特征;通过双向注意力机制融合所述第三图像特征以及所述第一文本特征,得到所述图像对应的第一图像特征以及所述文本对应的文本特征。
- 一种数据处理装置,其特征在于,包括:处理模块,用于获取图像对应的第一图像特征以及文本对应的文本特征;所述文本的语义对应于目标对象,且所述文本指示从所述图像中预测目标对象对应的区域;根据预设的多个第一嵌入向量以及所述第一图像特征,通过神经网络,得到多个第二嵌入向量,每个第二嵌入向量对应于所述图像中的一个对象;每个所述第二嵌入向量和所述第一图像特征用于融合得到一个对应的第二图像特征;根据所述文本特征和所述多个第二嵌入向量之间的相似度,确定每个所述第二嵌入向量对应的权重,多个所述权重用于和所述多个第二图像特征进行融合,以确定所述目标对象对应的预测区域。
- 根据权利要求11所述的装置,其特征在于,所述预测区域为掩码区域或者检测框。
- 根据权利要求11或12所述的装置,其特征在于,所述文本的语义对应于目标对象,具体包括:所述文本的语义用于描述所述目标对象的特征。
- 根据权利要求11至13任一所述的装置,其特征在于,所述处理模块,具体用于:通过图像编码器处理所述图像,得到所述图像对应的图像特征;通过文本编码器处理所述文本,得到所述文本对应的第一文本特征;通过双向注意力机制融合所述第三图像特征以及所述第一文本特征,得到所述图像对应的第一图像特征以及所述文本对应的文本特征。
- 根据权利要求11至14任一所述的装置,其特征在于,所述第一图像特征为上采样到和所述图像的尺寸一致的特征。
- 根据权利要求11至15任一所述的装置,其特征在于,所述神经网格包括多个transformer层。
- 一种数据处理装置,其特征在于,包括:处理模块,用于获取图像对应的第一图像特征以及文本对应的文本特征;所述文本的语义对应于目标对象,且所述文本指示从所述图像中预测目标对象对应的区域;所述第一图像特征以及所述文本特征为根据特征提取网络得到的;根据预设的多个第一嵌入向量以及所述第一图像特征,通过神经网络,得到多个第二嵌入向量,每个第二嵌入向量对应于所述图像中的一个对象;每个所述第二嵌入向量和所述第一图像特征用于融合得到一个对应的第二图像特征;根据所述文本特征和所述多个第二嵌入向量之间的相似度,确定每个所述第二嵌入向量对应的权重,多个所述权重用于和所述多个第二图像特征进行融合,以确定所述目标对象对应的预测区域;更新模块,用于根据所述预测区域和所述图像中所述目标对象对应的真实区域之间的差异,更新所述特征提取网络以及所述神经网络。
- 根据权利要求17所述的装置,其特征在于,所述预测区域为掩码区域或者检测框。
- 根据权利要求17或18所述的装置,其特征在于,所述文本的语义对应于目标对象,具体包括: 所述文本的语义用于描述所述目标对象的特征。
- 根据权利要求17至19任一所述的装置,其特征在于,所述处理模块,具体用于:通过图像编码器处理所述图像,得到所述图像对应的图像特征;通过文本编码器处理所述文本,得到所述文本对应的第一文本特征;通过双向注意力机制融合所述第三图像特征以及所述第一文本特征,得到所述图像对应的第一图像特征以及所述文本对应的文本特征。
- 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机执行权利要求1至10中任一项所述方法的操作。
- 一种计算机程序产品,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机设备上运行时,使得所述计算机设备执行如权利要求1至10任一所述的方法。
- 一种系统,包括至少一个处理器,至少一个存储器;所述处理器、所述存储器通过通信总线连接并完成相互间的通信;所述至少一个存储器用于存储代码;所述至少一个处理器用于执行所述代码,以执行如权利要求1至10任一所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23879109.9A EP4592866B1 (en) | 2022-10-20 | 2023-10-17 | Data processing method and apparatus |
| US19/182,947 US20250272945A1 (en) | 2022-10-20 | 2025-04-18 | Data processing method and apparatus |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211292146.9A CN115757692A (zh) | 2022-10-20 | 2022-10-20 | 一种数据处理方法及其装置 |
| CN202211292146.9 | 2022-10-20 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/182,947 Continuation US20250272945A1 (en) | 2022-10-20 | 2025-04-18 | Data processing method and apparatus |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024083121A1 true WO2024083121A1 (zh) | 2024-04-25 |
Family
ID=85352471
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/124977 Ceased WO2024083121A1 (zh) | 2022-10-20 | 2023-10-17 | 一种数据处理方法及其装置 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250272945A1 (zh) |
| EP (1) | EP4592866B1 (zh) |
| CN (1) | CN115757692A (zh) |
| WO (1) | WO2024083121A1 (zh) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118115622A (zh) * | 2024-04-28 | 2024-05-31 | 腾讯科技(深圳)有限公司 | 图像生成模型的处理方法、装置、设备、存储介质及产品 |
| CN118505858A (zh) * | 2024-07-19 | 2024-08-16 | 山东海量信息技术研究院 | 一种图像生成方法、设备、介质及计算机程序产品 |
| CN119228822A (zh) * | 2024-09-05 | 2024-12-31 | 北京纳通医用机器人科技有限公司 | 图像分割方法、装置、设备及介质 |
| CN119313891A (zh) * | 2024-12-18 | 2025-01-14 | 浙江大华技术股份有限公司 | 目标识别方法、电子设备和计算机可读存储介质 |
| CN119322982A (zh) * | 2024-09-13 | 2025-01-17 | 同济大学 | 基于模态同步的多模态情感检测方法 |
| CN119478423A (zh) * | 2025-01-13 | 2025-02-18 | 北京科技大学 | 一种基于开放域的跨模态遥感图像目标分割方法及装置 |
| CN120198392A (zh) * | 2025-03-11 | 2025-06-24 | 虹纬盐城纺织有限公司 | 基于图像识别的超柔针织纱布料缺陷检测方法及系统 |
| CN120495074A (zh) * | 2025-07-21 | 2025-08-15 | 杭州筋斗云文化服务有限公司 | 一种基于深度神经网络的人像摄影图像仿色方法和系统 |
Families Citing this family (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115757692A (zh) * | 2022-10-20 | 2023-03-07 | 华为技术有限公司 | 一种数据处理方法及其装置 |
| CN116503594A (zh) * | 2023-03-24 | 2023-07-28 | 华为技术有限公司 | 一种数据处理方法及其装置 |
| CN116433621B (zh) * | 2023-03-25 | 2026-02-13 | 华为技术有限公司 | 一种数据处理方法及其装置 |
| US20240357104A1 (en) * | 2023-04-21 | 2024-10-24 | Nokia Technologies Oy | Determining regions of interest using learned image codec for machines |
| CN116883715A (zh) * | 2023-05-30 | 2023-10-13 | 华为技术有限公司 | 一种数据处理方法及其装置 |
| CN116704506A (zh) * | 2023-06-21 | 2023-09-05 | 大连理工大学 | 一种基于交叉环境注意力的指代图像分割方法 |
| CN116993976B (zh) * | 2023-07-17 | 2024-06-14 | 中国科学院自动化研究所 | 引用图像分割模型训练方法及引用图像分割方法 |
| AU2023446714A1 (en) * | 2023-08-04 | 2025-02-20 | Shenzhen Mango Science And Technology Innovation Co., Ltd. | Method for adjusting lighting device, apparatus, lighting device and storage medium |
| CN117251592A (zh) * | 2023-08-23 | 2023-12-19 | 华为技术有限公司 | 一种数据处理方法及其装置 |
| US20260030861A1 (en) * | 2024-07-29 | 2026-01-29 | Nvidia Corporation | Segmentation of media content using vision language models |
| CN119314337A (zh) * | 2024-12-17 | 2025-01-14 | 松立控股集团股份有限公司 | 一种动静态交通协同治理方法及系统 |
| CN120747198B (zh) * | 2025-06-23 | 2026-01-02 | 电子科技大学 | 基于计算机视觉的米粒测量方法 |
| CN120599274B (zh) * | 2025-08-08 | 2025-11-07 | 广东工业大学 | 一种基于深度学习的直肠癌医学影像病灶分割方法及装置 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108776787A (zh) * | 2018-06-04 | 2018-11-09 | 北京京东金融科技控股有限公司 | 图像处理方法及装置、电子设备、存储介质 |
| CN113065576A (zh) * | 2021-02-26 | 2021-07-02 | 华为技术有限公司 | 一种特征提取的方法以及装置 |
| CN113505193A (zh) * | 2021-06-01 | 2021-10-15 | 华为技术有限公司 | 一种数据处理方法及相关设备 |
| US20220147715A1 (en) * | 2019-05-16 | 2022-05-12 | Huawei Technologies Co., Ltd. | Text processing method, model training method, and apparatus |
| CN115115913A (zh) * | 2022-06-02 | 2022-09-27 | 北京科技大学 | 一种数据处理方法、装置、电子设备及存储介质 |
| CN115757692A (zh) * | 2022-10-20 | 2023-03-07 | 华为技术有限公司 | 一种数据处理方法及其装置 |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110033022A (zh) * | 2019-03-08 | 2019-07-19 | 腾讯科技(深圳)有限公司 | 文本的处理方法、装置和存储介质 |
| CN111210443B (zh) * | 2020-01-03 | 2022-09-13 | 吉林大学 | 基于嵌入平衡的可变形卷积混合任务级联语义分割方法 |
| CN112215235B (zh) * | 2020-10-16 | 2024-04-26 | 深圳华付技术股份有限公司 | 一种针对具有大字符间距与局部遮挡的场景文本检测方法 |
| CN114283430B (zh) * | 2021-12-03 | 2024-12-06 | 苏州大创科技有限公司 | 跨模态图文匹配训练方法及装置、存储介质、电子设备 |
| CN114298057B (zh) * | 2022-01-04 | 2024-08-09 | 中国人民解放军国防科技大学 | 一种基于数据增强的视觉语义嵌入方法及系统 |
| CN114648631B (zh) * | 2022-03-22 | 2024-11-01 | 平安科技(深圳)有限公司 | 图像描述生成方法和装置、电子设备及存储介质 |
| CN115131556A (zh) * | 2022-05-27 | 2022-09-30 | 吉林大学 | 一种基于深度学习的图像实例分割方法 |
| CN115063585B (zh) * | 2022-05-30 | 2025-08-29 | 华为技术有限公司 | 一种无监督语义分割模型的训练方法及相关装置 |
-
2022
- 2022-10-20 CN CN202211292146.9A patent/CN115757692A/zh active Pending
-
2023
- 2023-10-17 EP EP23879109.9A patent/EP4592866B1/en active Active
- 2023-10-17 WO PCT/CN2023/124977 patent/WO2024083121A1/zh not_active Ceased
-
2025
- 2025-04-18 US US19/182,947 patent/US20250272945A1/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108776787A (zh) * | 2018-06-04 | 2018-11-09 | 北京京东金融科技控股有限公司 | 图像处理方法及装置、电子设备、存储介质 |
| US20220147715A1 (en) * | 2019-05-16 | 2022-05-12 | Huawei Technologies Co., Ltd. | Text processing method, model training method, and apparatus |
| CN113065576A (zh) * | 2021-02-26 | 2021-07-02 | 华为技术有限公司 | 一种特征提取的方法以及装置 |
| CN113505193A (zh) * | 2021-06-01 | 2021-10-15 | 华为技术有限公司 | 一种数据处理方法及相关设备 |
| CN115115913A (zh) * | 2022-06-02 | 2022-09-27 | 北京科技大学 | 一种数据处理方法、装置、电子设备及存储介质 |
| CN115757692A (zh) * | 2022-10-20 | 2023-03-07 | 华为技术有限公司 | 一种数据处理方法及其装置 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4592866A4 |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118115622A (zh) * | 2024-04-28 | 2024-05-31 | 腾讯科技(深圳)有限公司 | 图像生成模型的处理方法、装置、设备、存储介质及产品 |
| CN118505858A (zh) * | 2024-07-19 | 2024-08-16 | 山东海量信息技术研究院 | 一种图像生成方法、设备、介质及计算机程序产品 |
| CN119228822A (zh) * | 2024-09-05 | 2024-12-31 | 北京纳通医用机器人科技有限公司 | 图像分割方法、装置、设备及介质 |
| CN119322982A (zh) * | 2024-09-13 | 2025-01-17 | 同济大学 | 基于模态同步的多模态情感检测方法 |
| CN119322982B (zh) * | 2024-09-13 | 2025-11-18 | 同济大学 | 基于模态同步的多模态情感检测方法 |
| CN119313891A (zh) * | 2024-12-18 | 2025-01-14 | 浙江大华技术股份有限公司 | 目标识别方法、电子设备和计算机可读存储介质 |
| CN119478423A (zh) * | 2025-01-13 | 2025-02-18 | 北京科技大学 | 一种基于开放域的跨模态遥感图像目标分割方法及装置 |
| CN120198392A (zh) * | 2025-03-11 | 2025-06-24 | 虹纬盐城纺织有限公司 | 基于图像识别的超柔针织纱布料缺陷检测方法及系统 |
| CN120495074A (zh) * | 2025-07-21 | 2025-08-15 | 杭州筋斗云文化服务有限公司 | 一种基于深度神经网络的人像摄影图像仿色方法和系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115757692A (zh) | 2023-03-07 |
| EP4592866A1 (en) | 2025-07-30 |
| EP4592866A4 (en) | 2025-08-27 |
| EP4592866B1 (en) | 2026-03-11 |
| US20250272945A1 (en) | 2025-08-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024083121A1 (zh) | 一种数据处理方法及其装置 | |
| WO2024041479A1 (zh) | 一种数据处理方法及其装置 | |
| US20260080190A1 (en) | Data processing method and related device | |
| CN116306672A (zh) | 一种数据处理方法及其装置 | |
| WO2024245063A1 (zh) | 一种数据处理方法及其装置 | |
| WO2024260402A1 (zh) | 一种数据处理方法及其装置 | |
| WO2025040048A1 (zh) | 一种数据处理方法及其装置 | |
| WO2025026210A1 (zh) | 一种数据处理方法及其装置 | |
| WO2024239983A1 (zh) | 一种可控生成的方法及其装置 | |
| WO2024213099A1 (zh) | 一种数据处理方法及其装置 | |
| US20260044732A1 (en) | Data processing method and apparatus | |
| WO2024245061A1 (zh) | 一种数据处理方法及其装置 | |
| WO2024017287A1 (zh) | 一种模型训练方法及其装置 | |
| WO2025016380A1 (zh) | 一种数据处理方法及相关装置 | |
| CN112861474B (zh) | 一种信息标注方法、装置、设备及计算机可读存储介质 | |
| WO2025130968A1 (zh) | 一种数据处理方法及其装置 | |
| WO2025189851A9 (zh) | 一种数据处理方法及其装置 | |
| WO2025044967A1 (zh) | 一种数据处理方法及其装置 | |
| WO2025103049A1 (zh) | 一种数据处理方法及其装置 | |
| WO2025067287A1 (zh) | 一种视频编码方法及其装置 | |
| EP4546292A1 (en) | Data processing method and device | |
| WO2025031373A1 (zh) | 一种数据处理方法及其装置 | |
| CN116052714A (zh) | 一种数据处理方法及其装置 | |
| WO2025167446A1 (zh) | 一种数据处理方法及其装置 | |
| WO2025124301A1 (zh) | 一种文本处理方法及其装置 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23879109 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023879109 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023879109 Country of ref document: EP Effective date: 20250425 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023879109 Country of ref document: EP |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2023879109 Country of ref document: EP |