WO2023045605A1 - 数据处理方法、装置、计算机设备及存储介质 - Google Patents

数据处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2023045605A1
WO2023045605A1 PCT/CN2022/111609 CN2022111609W WO2023045605A1 WO 2023045605 A1 WO2023045605 A1 WO 2023045605A1 CN 2022111609 W CN2022111609 W CN 2022111609W WO 2023045605 A1 WO2023045605 A1 WO 2023045605A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
text
features
picture
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/111609
Other languages
English (en)
French (fr)
Other versions
WO2023045605A9 (zh
Inventor
朱灵子
马连洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to EP22871663.5A priority Critical patent/EP4310695A4/en
Publication of WO2023045605A1 publication Critical patent/WO2023045605A1/zh
Priority to US18/232,098 priority patent/US20230386238A1/en
Anticipated expiration legal-status Critical
Publication of WO2023045605A9 publication Critical patent/WO2023045605A9/zh
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/18133Extraction of features or characteristics of the image regional/local feature not essentially salient, e.g. local binary pattern
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/20Combination of acquisition, preprocessing or recognition functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30176Document
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features

Definitions

  • This application relates to the field of computer technology, in particular to data processing technology.
  • the task of excavating high-quality articles has gradually become a research hotspot. Through this task, high-quality articles can be discovered and pushed to users to improve the reading experience of users.
  • the content quality of the article is usually judged from the perspective of the text (without considering the contribution of the picture to the content quality), or the embedding features of the text and the picture are concatenated (concat) , to determine whether it is a high-quality article based on the spliced features.
  • Embodiments of the present application provide a data processing method, device, computer equipment, and storage medium, which can improve the accuracy of article category recognition, and further improve the accuracy of mining high-quality articles.
  • the technical solution is as follows:
  • a data processing method executed by a computer device, the method comprising:
  • the text features are used to characterize the text data in the article
  • the picture features are used to characterize the picture data in the article
  • For the text feature based on some features associated with the text feature in the picture feature, determine a first interaction feature, and the first interaction feature is used to characterize the text feature that combines the picture feature;
  • For the picture feature based on the partial features associated with the picture feature in the text feature, determine a second interaction feature, and the second interaction feature is used to characterize the picture feature fused with the text feature;
  • An article category to which the article belongs is determined based on the cross-modal interaction feature.
  • a data processing device comprising:
  • the first acquisition module is used to acquire text features and picture features of the article, the text features are used to characterize the text data in the article, and the picture features are used to characterize the picture data in the article;
  • the second acquisition module is configured to determine a first interaction feature for the text feature based on some features associated with the text feature in the picture feature, and the first interaction feature is used to characterize the text fused with the picture feature feature;
  • the third acquisition module is used to determine the second interaction feature for the picture feature based on the partial features associated with the picture feature in the text feature, and the second interaction feature is used to characterize the picture fused with the text feature feature;
  • a fusion module configured to fuse the first interaction feature with the second interaction feature to obtain a cross-modal interaction feature
  • a determining module configured to determine the article category to which the article belongs based on the cross-modal interaction feature.
  • a computer device in one aspect, includes one or more processors and one or more memories, at least one computer program is stored in the one or more memories, and the at least one computer program is executed by the one or more Multiple processors are loaded and executed to implement the data processing method in any one of the above possible implementation manners.
  • a storage medium is provided, and at least one computer program is stored in the storage medium, and the at least one computer program is loaded and executed by a processor to implement the data processing method in any possible implementation manner above.
  • a computer program product or computer program comprising one or more pieces of program code stored in a computer-readable storage medium.
  • One or more processors of the computer device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can Execute the data processing method in any one of the foregoing possible implementation manners.
  • this method By extracting text features and image features from the text data and image data of the article, and using the cross-modal interaction features between the two to predict the article category to which the article belongs, this method considers both text modalities and images The degree of contribution of each mode to the article category is not judged only from the perspective of text.
  • the extracted cross-modal interaction features are not a simple splicing of text features and image features, which can reflect richer and deeper
  • the interactive information between modalities helps to improve the recognition accuracy of article categories, and then in the scenario of identifying high-quality articles, it can improve the mining accuracy of high-quality articles.
  • FIG. 1 is a schematic diagram of an implementation environment of a data processing method provided in an embodiment of the present application
  • FIG. 2 is a flow chart of a data processing method provided by an embodiment of the present application.
  • FIG. 3 is a flow chart of a data processing method provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of the principle of extracting location information provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of a cross-mode interaction model provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a multimodal fusion network combined with relative position coding provided by an embodiment of the present application
  • FIG. 7 is a flow chart of a data processing method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a multimodal fusion network provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a data processing device provided in an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • first and second are used to distinguish the same or similar items with basically the same function and function. It should be understood that “first”, “second” and “nth” There are no logical or timing dependencies, nor are there restrictions on quantity or order of execution.
  • the term "at least one" means one or more, and the meaning of "multiple” means two or more, for example, a plurality of first positions means two or more first positions.
  • the solution provided by the embodiment of the present application involves technologies such as artificial intelligence machine learning, especially multi-modal machine learning (Multi-Modal Machine Learning, MMML) technology.
  • multi-modal machine learning Multi-Modal Machine Learning, MMML
  • MMML Multi-Modal Machine Learning
  • Modality Every source or form of information can be called a modality. For example, people have senses of touch, hearing, vision, and smell; information media include voice, video, text, etc.; various sensors, such as radar, infrared, accelerometer, etc., each of which can be called a model state.
  • modality can also have a very broad definition. For example, two different languages can be regarded as two modalities, and even data sets collected in two different situations can also be considered as two modalities.
  • an article can optionally be divided into two modes of text and pictures, or divided into three modes of title, text and pictures.
  • Multimodal machine learning aims to realize the function of processing and understanding multi-source modal information through machine learning methods.
  • unimodal refers to a single modality
  • multimodal refers to the combination of two or more modalities in various forms.
  • the popular research direction is multi-modal learning among image, video, audio and semantics.
  • multimodal learning is divided into the following research directions: multimodal representation learning, modal transformation, alignment, multimodal fusion, collaborative learning, etc.
  • Single-modal representation learning is responsible for representing information as numerical vectors that computers can process, or further abstracting it into higher-level feature vectors, while multi-modal representation learning refers to eliminating modal Redundancy among them, so as to learn a better feature representation.
  • Multimodal Fusion It is a research direction of multimodal learning. Multimodal fusion is responsible for combining information of multiple modalities for target prediction (classification or regression). It belongs to one of the earliest research directions of MMML. It is also the most widely used direction at present. There are other common aliases for multi-modal fusion, such as multi-source information fusion (Multi-source Information Fusion), multi-sensor fusion (Multi-sensor Fusion), etc. In the embodiment of this application, it involves the two-modal fusion of the text mode and the picture mode in an article. Since the text mode can be divided into a title mode and a text mode, it can also involve the title mode , the three-modal fusion of text mode and image mode.
  • Multi-source Information Fusion Multi-source Information Fusion
  • Multi-sensor Fusion multi-sensor Fusion
  • High-quality graphics and texts From the perspective of the content of the article itself, detecting high-quality articles with both content quality and reading experience can help the recommendation side better understand and apply the articles (that is, graphic content) published by the content center.
  • articles that is, graphic content
  • it when comprehensively evaluating the content quality of an article, it can be modeled separately from the multi-modal fusion of graphics and text, article typesetting experience, account atomic features and other dimensions, and finally complete the identification of high-quality articles.
  • Relative Position Embedding refers to a position encoding method in the Transformer (transformer) model. There are two ways to encode the position of the Transformer model: absolute position encoding and relative position encoding.
  • Absolute position encoding is a commonly used position encoding method at present, that is, directly randomly initialize a position vector (Position Embedding) for characters in different positions, add it to the input character vector (Word Embedding) sequence and input it into the model, and use it as a parameter for training.
  • the position vectors corresponding to characters at different positions are different, but the relative meaning of characters at different positions cannot be obtained explicitly through absolute position encoding, for example: the distance between position 1 and position 2 is greater than that between position 3 and position The distance of 10 is closer, and the difference between position 1 and position 2 and position 3 and position 4 is only 1.
  • the relative relationship between positions can only be learned implicitly.
  • by introducing Relative position encoding can enhance the feature representation of the relative relationship between positions.
  • FIG. 1 is a schematic diagram of an implementation environment of a data processing method provided by an embodiment of the present application.
  • this implementation environment includes: a terminal 110 and a server 120 , both of which are an example of computer equipment.
  • the terminal 110 is used to support users to browse various articles including graphic content.
  • the articles include but not limited to: web page information, official account tweets, blogs, microblogs, etc.
  • the embodiment of this application does not specifically limit the type of articles .
  • An application program that supports browsing articles is installed and running on the terminal 110.
  • the application program can be a browser application, a social application, a graphic information application, a news viewing application, etc.
  • the embodiment of the present application does not specify the type of the application program. limited.
  • the user starts an application program on the terminal 110, through which the high-quality articles pushed by the server 120 can be browsed.
  • the terminal 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal 110 and the server 120 may be connected directly or indirectly through wired or wireless communication, which is not limited in this application.
  • the server 120 is used to identify and push high-quality articles, that is, the server 120 is used to provide background services to the application program installed on the terminal 110 .
  • the server 120 collects articles published by creators on the platform, extracts the titles, texts and pictures in the articles, and judges whether the corresponding articles are high-quality articles according to the corresponding title features, text features and picture features, and recommends In the stage, the recommendation weight is increased for the identified high-quality articles, so that the high-quality articles are more likely to be pushed to the terminal 110 used by the user.
  • the server 120 includes at least one of a server, multiple servers, a cloud computing platform, or a virtualization center.
  • the server 120 undertakes the main calculation work, and the terminal 110 undertakes the secondary calculation work; or, the server 120 undertakes the secondary calculation work, and the terminal 110 undertakes the main calculation work; or, a distributed computing architecture is adopted between the terminal 110 and the server 120 Perform collaborative computing.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network) and big data and artificial intelligence platforms.
  • cloud services cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network) and big data and artificial intelligence platforms.
  • the number of the foregoing terminals 110 may be more or less. For example, there may be only one terminal 110, or there may be tens or hundreds of terminals 110, or more. The embodiment of the present application does not limit the number and device types of terminals 110 .
  • FIG. 2 is a flow chart of a data processing method provided by an embodiment of the present application. Referring to Fig. 2, this embodiment is carried out by computer equipment, takes computer equipment as server example and illustrates, and this embodiment comprises the following steps:
  • the server acquires text features and image features of an article, where the text features are used to represent text data in the article, and the image features are used to represent image data in the article.
  • the above-mentioned article refers to any article whose category is to be determined.
  • the type of the article includes but is not limited to: web page information, official account tweets, blogs, microblogs, etc.
  • the embodiment of the present application does not specifically limit the type of the article.
  • a server is an exemplary description of a computer device, including at least one of a server, multiple servers, a cloud computing platform, or a virtualization center.
  • the computer device is used as an example for illustration, that is, the identification of the type of the article is completed on the server side, for example, identifying whether the article is a high-quality article.
  • the step of identifying the article type can also be deployed on the terminal side, for example, the terminal independently identifies whether the article is a high-quality article.
  • the server obtains the article.
  • the article can be an article stored in an article database, or the article can be the latest article uploaded by the terminal to the server, or the article can be obtained from a distributed file system.
  • the source of the article downloaded in the embodiment of the present application is not specifically limited.
  • the image quality of the two articles will determine whether the two articles are high-quality articles.
  • the picture mode that is, the visual mode
  • the picture mode is also introduced. Comprehensively utilize multi-modal data to accurately identify high-quality articles.
  • the server extracts the text data and the picture data in the article respectively.
  • the above process can be regarded as an extraction process for the multi-modal data in the article.
  • it can be divided into only two modes of text and pictures, which can reduce the computational complexity of multi-modal fusion.
  • the server can extract the text semantic features of the text data and the picture depth features of the picture data, and fuse the text semantic features and text position features to obtain text features.
  • the image feature is obtained by fusing the image depth feature and the image position feature.
  • the server can further extract the title data and the text data separately when extracting the text data, so as to introduce more and richer feature information, That is, it is divided into three modes of title, text and picture to improve the recognition accuracy of the overall article category.
  • the server extracts the title semantic features of the title data, the text semantic features of the text data, and the picture depth features of the picture data, and then extracts the title semantic features
  • the title feature is obtained by merging with the title position feature
  • the text feature is obtained by merging the text semantic feature and the text position feature
  • the image feature is obtained by merging the image depth feature and the image position feature.
  • the above-mentioned position features such as text position features, picture position features, title position features, and text position features can all be absolute position features obtained by using absolute position encoding, which can simplify the training process of multi-modal fusion.
  • the above-mentioned position features such as text position features, picture position features, title position features, and text position features can all be relative position features obtained by using relative position encoding, and the relative position features are used to represent the corresponding text.
  • the relative position encoding method of the relative position feature will be described in detail, and will not be repeated here.
  • the server determines a first interaction feature based on some features associated with the text feature in the picture feature, and the first interaction feature is used to characterize the text feature fused with the picture feature.
  • the server can use the text mode as the main mode and receive the auxiliary information of the picture mode, for example, obtain some features associated with the text feature in the picture features , and on this basis, the first interaction feature is obtained.
  • the first interaction features include title interaction features and text interaction features.
  • the server can take the title mode as the main mode, and receive the auxiliary information of the text mode and the picture mode respectively, for example, obtain part of the features associated with the title feature in the text features and picture features, and here Based on the interactive features of the title; and, with the text mode as the main mode, the auxiliary information of the title mode and the picture mode are respectively received, for example, part of the features associated with the text features in the title features and picture features are respectively obtained , and on this basis, the interactive features of the text are obtained.
  • the server determines a second interaction feature based on some features associated with the picture feature in the text feature, where the second interaction feature is used to characterize the picture feature combined with the text feature.
  • the server can take the picture mode as the main mode and receive auxiliary information of the text mode, for example, obtain some features associated with the picture feature in the text feature , and the second interaction feature is obtained on this basis.
  • the server uses the picture mode as the main mode, and receives the auxiliary information of the title mode and the text mode respectively, for example, obtains title features and Part of the features associated with the picture features in the text features, and on this basis to obtain the second interaction features.
  • the server fuses the first interaction feature with the second interaction feature to obtain a cross-modal interaction feature.
  • the cross-modal interaction features obtained in the above step 204 refer to the information obtained by fusing the multi-modal data with reference to the features of other modal data, so that the cross-modal interaction features can strengthen the text
  • the relationship between the data and the picture data wherein, for each modality in the multimodality, it will receive the assistance of other modalities as the main modality, and obtain a cross-modal feature, and the cross-modality corresponding to each modality
  • the modal features are fused to obtain the final cross-modal interaction features.
  • the server can directly fuse the first interaction feature and the second interaction feature to obtain the final cross-modal interaction feature.
  • the text mode and the picture mode are expressed differently, so there may be intersection (ie, information redundancy) or complementarity (ie, compared with the single mode feature) between the text feature and the picture feature.
  • intersection ie, information redundancy
  • complementarity ie, compared with the single mode feature
  • the salient features of multimodal data are redundancy and complementarity, and there may even be a variety of different information interactions between modalities, so by using the text modal as the main modal
  • the first interaction feature is extracted from the mode
  • the second interaction feature is extracted from the image mode as the main mode, which can reasonably process the multi-modal data in the article and obtain richer interaction features.
  • the server may fuse the title interaction feature, text interaction feature and the second interaction feature to obtain the final cross-modal interaction feature.
  • the server determines the article category to which the article belongs based on the cross-modal interaction feature.
  • the server performs full-connection processing on the cross-modal interaction feature to obtain the full-connection feature; performs exponential normalization on the full-connection feature to obtain the probability prediction result of the article, and the probability prediction result includes multiple These multiple predicted probabilities are in one-to-one correspondence with multiple categories, that is, the probabilistic predicted results represent multiple predicted probabilities of articles belonging to multiple categories; furthermore, the categories corresponding to the predicted probabilities that meet the target conditions are determined as the article The article category to which it belongs.
  • the server inputs the cross-modal interaction feature into a fully-connected layer or a fully-connected network, outputs the fully-connected feature, and then uses the exponentially normalized Softmax function to map the fully-connected feature to obtain articles belonging to each predicted probabilities for each category. Further, from all the predicted probabilities, select the predicted probability that meets the target condition, and determine the category corresponding to the predicted probability that meets the target condition as the article category to which the article belongs.
  • the target condition may be that the prediction probability is the largest, then the server may determine the maximum prediction probability from the plurality of prediction probabilities, and determine the category corresponding to the maximum prediction probability as the article category to which the article belongs. Alternatively, the server may sort the plurality of predicted probabilities in descending order, and select the category corresponding to the predicted probability ranked first in the ranking as the article category to which the article belongs.
  • the target condition can be that the predicted probability is greater than the probability threshold
  • the server can determine each predicted probability greater than the probability threshold from the plurality of predicted probabilities, and select each category corresponding to each predicted probability greater than the probability threshold Randomly select a category in the article as the article category to which the article belongs.
  • the predicted probability is any numerical value greater than or equal to 0 and less than or equal to 1.
  • the target condition can be randomly selected as the predicted probability topK (K ⁇ 1), then the server can sort the multiple predicted probabilities in order from large to small, and select and sort the top K predict the probability, and randomly select a category from the K categories corresponding to the K predicted probabilities as the article category to which the article belongs.
  • K is an integer greater than or equal to 1.
  • the article category to which the article belongs can be identified.
  • the article category can be divided according to whether the article is a high-quality article, for example, divided into: high-quality articles, non- High-quality articles, etc.;
  • article categories can also be divided according to the field to which the main content of the article belongs, for example, it can be divided into: finance, entertainment, news, popular science, etc.
  • the embodiment of this application does not divide the article categories The method is specifically limited.
  • the article category is divided according to whether the article is a high-quality article, so it can be applied to the scene of identifying high-quality articles, that is, identifying high-quality graphic content.
  • the article categories are divided into: high-quality articles and non-high-quality articles , or, article categories are divided into: high-quality articles, ordinary articles, low-quality articles, etc., and the embodiment of the present application does not specifically limit the article category classification method.
  • the cross-model interaction information between adjacent texts is very important, because the text in the article is usually a sequence of characters or a sequence of sentences , and the pictures can also be arranged into a picture sequence in sequence, so the text mode and the picture mode can achieve sequence-level interaction, so that by constructing a sequence-level multi-modal fusion network, the respective characteristics of text and pictures can be In the case of non-alignment, make full use of the sequence-level interaction information between modalities, extract the interaction features between multiple modalities, and put them into the prediction process of article categories to improve the recognition accuracy of article categories.
  • article categories are classified according to the field of the main content of the article, so it can be applied to the scenario of precise push according to user portraits.
  • article categories are divided into: finance, entertainment, and news , popular science, etc.
  • the server determines whether to recommend the article to the user based on the similarity between the cross-modal interaction features of the article and the user features of the user, so that the user can Recommend articles that meet the user's long-term preferences, or, based on the similarity between the cross-modal interaction characteristics of the article and the user's historical reading average characteristics, the server determines whether to recommend the article to the user, wherein the historical reading average characteristics are Refers to the average feature of the cross-modal interaction features of historical articles read by the user in the last week (or within a specified period of time such as one month or two months), so that articles that meet their recent preferences can be recommended to users.
  • the method provided in the embodiment of the present application extracts text features and picture features from the text data and picture data of the article respectively, and uses the cross-modal interaction features between the two to predict the article category to which the article belongs. At the same time, the contribution of the text mode and the picture mode to the article category is considered, instead of judging only from the text point of view.
  • the extracted cross-modal interaction features are not a simple splicing of text features and picture features. It can It reflects richer and deeper interactive information between modalities, which helps to improve the recognition accuracy of article categories, and then improves the mining accuracy of high-quality articles in the scenario of identifying high-quality articles.
  • Fig. 3 is a flow chart of a data processing method provided by an embodiment of the present application. Referring to Fig. 3, this embodiment is illustrated by computer equipment, taking the computer equipment as an example. For the situation where the target article is only divided into two modes, text and pictures, how to Fusion mode identifies the article category of article, and this embodiment comprises the following steps:
  • the server acquires text data and picture data in the article.
  • an article refers to any article whose category is to be determined.
  • the type of the article includes but is not limited to: webpage information, official account tweets, blogs, microblogs, etc.
  • the embodiment of the present application does not specifically limit the type of the article.
  • the server obtains the article, optionally, the article is an article stored in the article database, or the article is the latest article uploaded by the terminal to the server, or the article is downloaded from the distributed file system Article, the embodiment of this application does not specifically limit the source of the article.
  • the server extracts text data and picture data in the article respectively.
  • the above process can be regarded as an extraction process for multi-modal data in the article.
  • it is only divided into two modes of text and pictures, which can reduce the computational complexity of multi-modal fusion.
  • the server extracts text semantic features of the text data, and fuses the text semantic features with text position features to obtain text features of the text data.
  • the server may extract the text semantic feature based on a text encoding model, and the text encoding model is used to extract the text semantic feature of the text data, that is, the server inputs the text data into the text encoding model,
  • the text data is encoded by the text encoding model to obtain the semantic features of the text.
  • the model structure of the text encoding model includes but is not limited to any one of the following or a combination of at least two items: BERT (Bidirectional Encoder Representation From Transformers, a translation model using bidirectional encoding representation), Transformers (transformer, a A classic translation model), ELMo (Embeddings From Language Models, language model using embedded processing), NNLM (Neural Network Language Model, neural network language model), etc.
  • BERT Bidirectional Encoder Representation From Transformers, a translation model using bidirectional encoding representation
  • Transformers transformer, a A classic translation model
  • ELMo Embeddings From Language Models, language model using embedded processing
  • NNLM Neurological Network Language Model
  • the embodiment of this application does not specify the model structure of the text coding model limited.
  • the text encoding model is a BERT model to reduce the computational complexity of the feature extraction process.
  • the text encoding model is formed by cascading the BERT model and the encoder (Encoder) of
  • the text encoding model is illustrated by cascading the BERT model and the encoder of the Transformers model as an example.
  • the text data includes at least one sentence
  • the server performs word segmentation processing on each sentence to obtain the At least one character
  • the characters of each sentence can be arranged according to the order in which they appear in the article to form a character sequence, in this character sequence, [SEP] is added at the end of each sentence as a sentence separator, and in this [CLS] is added to the first character sequence as a classifier, where the sentence separator is used to segment sentences between adjacent sentences, and the classifier is used to represent the global semantic information of the entire character sequence.
  • the BERT model includes an embedding layer and at least one bidirectional coding layer, each bidirectional coding layer is used for forward coding and reverse coding of the input signal, each bidirectional coding layer The output of is used as the input of the next bidirectional encoding layer, that is, the serial connection between each bidirectional encoding layer.
  • Each bidirectional encoding layer consists of two parts, one part is the attention network, and the other part is the forward fully connected layer.
  • Each hidden layer in the attention network is obtained by weighted average of the hidden layer of the previous layer, so that each The hidden layer can be directly related to all hidden layers of the previous layer, and a hidden layer vector for representing global information can be obtained by using the input long sequence information (that is, the character sequence), while the forward fully connected layer uses It is used to further process the global information determined by the attention network to enhance the learning ability of the entire BERT model.
  • the character sequence is first input into the embedding layer of the BERT model, and each character in the character sequence is embedded through the embedding layer, that is, each character is mapped to the embedding space, and the embedding vector of each character is obtained, that is, A sequence of embedding vectors. Then, input the embedding vector sequence into the at least one bidirectional encoding layer, perform bidirectional encoding (including forward encoding and reverse encoding) on each embedding vector in the embedding vector sequence through the at least one bidirectional encoding layer, and output each The semantic vector of the character, that is, a sequence of semantic vectors is obtained.
  • each character in the character sequence corresponds to an embedding vector in the embedding vector sequence
  • each embedding vector in the embedding vector sequence corresponds to a semantic vector in the semantic vector sequence.
  • the embedded vector sequence is forward-encoded and reverse-encoded respectively through the bidirectional coding layer.
  • forward coding the semantic vector corresponding to each character can be fused with the relevant information of the character that appeared before the character.
  • Directional encoding enables the semantic vector corresponding to each character to fuse the relevant information of the characters that appear after the character, and the encoding operation in two directions can greatly improve the expressive ability of the semantic vector of each character.
  • the attention feature sequence is bidirectionally encoded (including forward encoding and reverse encoding) through the forward fully connected layer, and a hidden vector sequence is output.
  • the hidden vector sequence is input into the second bidirectional coding layer, and so on, and the processing logic of the subsequent bidirectional coding layer is similar to that of the first bidirectional coding layer, which will not be repeated here. Due to the introduction of the attention mechanism in the bidirectional encoding layer, each character can be focused on the character that is more related to itself (closer relationship) each time the semantic encoding is performed, so that the semantic vector of each character finally obtained has more high accuracy.
  • the Transformers model includes multiple cascaded encoders, for example, including N (N ⁇ 1) cascaded
  • N N ⁇ 1 cascaded
  • the number of encoders, such as N 6 or other numbers, is not specifically limited in this embodiment of the present application.
  • Each encoder also includes a multi-head attention (Multi-Head Attention) layer and a feedforward neural network (FeedForward Neural Network) layer.
  • Multi-Head Attention Multi-Head Attention
  • FeedForward Neural Network feedforward neural network
  • the association relationship between characters, the feedforward neural network layer is used to fully connect the feature vectors output by the multi-head attention layer, and a residual structure is set after the multi-head attention layer and the feedforward neural network layer, that is, the current layer
  • the input and output are residually connected (that is, spliced), and normalized before being input to the next layer.
  • the input semantic vector sequence is encoded by multiple encoders of the Transformers model, and the last encoder outputs the text semantic features of the text data.
  • the server may also obtain the text position feature of the text data, where the text position feature is used to characterize the position order of each character in the text data.
  • the position information of each character in the character sequence is encoded to obtain the text position feature of the text data.
  • the text semantic feature and the text position feature are spliced (Concat) to obtain the text feature of the text data.
  • the server when it encodes the location information of each character, it may adopt an absolute position encoding method or a relative position encoding method, and the embodiment of the present application does not specifically limit the encoding method of the position information.
  • the use of the relative position encoding method will be used as an example for illustration, and details will not be described here.
  • a 1-dimensional convolutional layer can be used to perform dimension transformation on the text semantic feature (that is, increase or decrease the dimension), so that the dimensionally transformed text semantic features and text position features have the same dimension, so that the dimensionally transformed text semantic features and text position features are spliced to obtain the text features of the text data.
  • the 1-dimensional convolutional layer refers to a convolutional layer with a convolution kernel size of 1 ⁇ 1.
  • fusion in addition to fusion by splicing, methods such as element-wise addition, element-wise multiplication, and bilinear fusion can also be used to fuse the text semantic features and text position features.
  • methods such as element-wise addition, element-wise multiplication, and bilinear fusion can also be used to fuse the text semantic features and text position features.
  • the embodiment of the present application The fusion method is not specifically limited.
  • the server extracts the picture depth feature of the picture data, and fuses the picture depth feature with the picture position feature to obtain the picture feature of the picture data.
  • the server may extract the picture depth feature based on the picture depth model, and the picture depth model is used to extract the picture depth feature of the picture data, that is to say, the server inputs the picture data into the picture depth model, The image data is convoluted through the image depth model to extract the image depth features.
  • the picture depth model includes but is not limited to: Convolutional Neural Networks (CNN), deep residual network (ResNet), MobileNet (a lightweight neural network), etc., the embodiment of the present application is not correct
  • CNN Convolutional Neural Networks
  • ResNet deep residual network
  • MobileNet a lightweight neural network
  • the picture depth model can be a MobileNet model.
  • the MobileNet model refers to replacing the standard convolution layer in the VGG (Visual Geometry Group) model with a depthwise separable convolution (Depthwise Separable Convolution) layer.
  • the depth Separable convolution is a decomposable convolution operation that can be decomposed into: depthwise convolution (Depthwise Convolution) and pointwise convolution (Pointwise Convolution), where depthwise convolution is different from standard convolution, and standard convolution convolution
  • the product kernel is used on all channels of the input feature map, and the depth convolution uses a different convolution kernel for each input channel, that is, one convolution kernel corresponds to one input channel, and the point-by-point convolution is a 1-dimensional convolution, that is The size of the convolution kernel used is a standard convolution of 1 ⁇ 1.
  • the MobileNet model includes a 3 ⁇ 3 standard convolutional layer, stacked multiple depth-separable convolutional layers, a mean pooling layer, and a fully connected layer, which can be performed after the standard convolutional layer. Sampling is then input to the next layer. Similarly, after partial depth separable convolutional layers, downsampling can be performed and then input to the next layer.
  • the mean pooling layer is used to output the last depth separable convolutional layer
  • the feature map of the mean pooling layer is used for full connection of the feature map output by the mean pooling layer.
  • the MobileNet model includes a total of 28 layers, of which there are 13 layers in the depthwise separable convolutional layer.
  • the multiple pictures can constitute a picture sequence
  • the server inputs the picture sequence into the MobileNet model, and performs standard convolution on the picture sequence through a standard convolution layer Operation to obtain the first feature map, input the first feature map to multiple depth-separable convolution layers in the cascade, and each depth-separable convolution layer performs depth-separable convolution on the feature map output by the previous layer Product operation, the last depth-separable convolutional layer outputs the second feature map, and the second feature map is input into the mean pooling layer, and the mean pooling operation is performed on the second feature map through the mean pooling layer to obtain the first Three feature maps, the third feature map is input into the fully connected layer, and the third feature map is fully connected through the fully connected layer to obtain the deep feature of the picture.
  • the server may also obtain a picture position feature of the picture data, where the picture position feature is used to represent the sequence of positions of each picture in the picture data.
  • the position information of each picture in the picture sequence is encoded to obtain the picture position feature of the picture data.
  • concatenate (Concat) the depth feature of the picture and the position feature of the picture to obtain the picture feature of the picture data.
  • the server when it encodes the position information of each picture, it may adopt an absolute position encoding mode or a relative position encoding mode, and the embodiment of the present application does not specifically limit the encoding mode of the position information.
  • the relative position encoding method is used as an example for illustration.
  • the text position feature and the picture position feature are both relative position features between the text data and the picture data, and the relative position feature It is used to represent the sequence and distance between the text data and the picture data.
  • the method of obtaining the relative position feature includes: determining the respective position information of multiple texts in the text data and multiple pictures in the picture data in the article; based on the position information, constructing a relative position code Any element in the relative position encoding matrix is used to represent the relative position information between the text corresponding to the column to which the element belongs and the picture corresponding to the row to which the element belongs; based on the relative position encoding matrix, determine the The relative positional feature between any text of , and any picture in the plurality of pictures.
  • FIG. 4 is a schematic diagram of the principle of extracting position information provided by the embodiment of the present application.
  • the article includes 5 paragraphs of text and 4 pictures, wherein the position number of the text sequence formed by the 5 paragraphs of text is ⁇ 1,3,5,7,8 ⁇ , the position number of the picture sequence composed of 4 pictures is ⁇ 0,2,4,6 ⁇ , then the absolute position relationship extracted for the article can be expressed as: pos-0 (img), pos-1(text), pos-2(img), pos-3(text), pos-4(img), pos-5(text), pos-6(img), pos-7( text), pos-8(text).
  • the matrix columns are represented by text sequences
  • the matrix rows are represented by image sequences
  • the relative position encoding matrix shown in Table 1 below is constructed:
  • each element in the relative position encoding matrix is used to represent the relative position information between the text corresponding to the column to which the element belongs and the picture corresponding to the row to which the element belongs.
  • the relative position information between each text and each picture can be determined through the relative position encoding matrix, and the corresponding relative position feature can be obtained by encoding the relative position information.
  • the text position features and picture position features determined by using the traditional absolute position encoding method are all absolute position features, which can implicitly learn the correlation of different text paragraphs and different article illustrations in position, fully considering The modal internal positional relationship between the text sequence and image sequence.
  • a 1-dimensional convolutional layer can be used to perform dimensional transformation on the picture depth feature (that is, increase or decrease the dimension), so that the image depth feature after dimension transformation is the same as the image position feature dimension, and then the image depth feature after dimension transformation and the image position feature are spliced to obtain the image feature of the image data.
  • the 1-dimensional convolutional layer refers to a convolutional layer with a convolution kernel size of 1 ⁇ 1.
  • element-wise addition in addition to splicing fusion, element-wise addition, element-wise multiplication, and bilinear fusion can also be used to fuse the image depth features and image position features.
  • the embodiment of the present application The fusion method is not specifically limited.
  • the server obtains the possible implementation of the text features and picture features of the article, wherein the text features are used to characterize the text in the article Data, the image feature is used to characterize the image data in this article.
  • the server obtains the title feature of the title data, the text feature of the text data, and the picture feature of the picture data in the case of the three-modal fusion of the title, text and picture, which will not be repeated here.
  • step 302 may be executed first and then step 303 may be executed, or step 303 may be executed first and then step 302 may be executed, or step 302 and step 303 may be executed at the same time. 303 execution order without any restrictions.
  • the server determines a first interaction feature based on some features associated with the text feature in the picture feature, where the first interaction feature is used to characterize the text feature combined with the picture feature.
  • the text mode is used as the main mode
  • the auxiliary information of the picture mode is received, that is to say, the server obtains some features associated with the text features in the picture features, and uses the cross-modal interaction model to The text feature and the partial feature are processed to obtain the first interaction feature.
  • the cross-model interaction model includes, but is not limited to: Transformers model or variants of Transformers model.
  • FIG. 5 is a schematic diagram of a cross-modal interaction model provided by an embodiment of the present application.
  • the cross-modal interaction model is a Cross-modal (cross-modal) Transformers model as an example for illustration.
  • the cross-modal Transformers model includes D+1 (D ⁇ 0) cross-modal interaction layers, assuming that the ⁇ mode is the main mode (such as the text mode), and the ⁇ mode is the auxiliary mode (such as the picture mode), then from the ⁇ mode to the ⁇
  • D+1 (D ⁇ 0) cross-modal interaction layers assuming that the ⁇ mode is the main mode (such as the text mode), and the ⁇ mode is the auxiliary mode (such as the picture mode), then from the ⁇ mode to the ⁇
  • the extraction process of the interaction features of the mode ( ⁇ ) is as follows:
  • LayerNorm LayerNorm
  • Multi-Head multi-head attention
  • Addition cross-mode feature Element-wise addition
  • the final interaction feature is finally output by the D layer (ie the first interaction feature).
  • the two element-wise addition operations are equivalent to performing residual connections on the respective input and output of the multi-head attention layer and the position fully connected feedforward network layer.
  • the cross-modal Transformers model receives the sequence features of the main mode and the auxiliary mode as input signals, and after two stages of processing, the multi-head attention layer and the position fully connected feedforward network layer, the final output is fused with the auxiliary mode.
  • the main mode representation of state information ie, the first interaction feature
  • the multi-head attention layer can be regarded as a transformation of the self-attention (Self-Attention) layer, changing the K and V of the input feature to the auxiliary mode ⁇
  • the feature sequence K ⁇ and V ⁇ , and the feature Q is the feature sequence Q ⁇ of the main mode ⁇ .
  • the main mode is used to select the auxiliary mode information that has an interactive relationship with itself in the auxiliary mode, so that the extracted first Interaction features have stronger feature expression capabilities.
  • the structure of the cross-modal Transformers model is versatile and flexible, and can be customized and combined according to the importance of the modality during model design, and the cross-modal attention mechanism in the multi-head attention layer is directed , that is, for the same pair of input modes ⁇ text, picture ⁇ , the interaction features extracted from the text-based mode and the picture-based mode are different, for example, when the text-based mode is used in step 304
  • the first interaction feature is extracted, and the following step 305 extracts the second interaction feature when the picture is the main mode.
  • the first interaction feature is different from the second interaction feature, which helps the model to make full use of the modality
  • the cross-mode Transformers model uses the stacking of multiple cross-mode interaction layers, which can integrate more high-level interaction information than the traditional single-layer interaction scheme.
  • the first interaction feature is directly output from the cross-mode Transformers model, which can reduce the computational complexity of obtaining the first interaction feature.
  • the features output by the cross-mode Transformers model can also be used as intermediate interaction features, and then the intermediate interaction features can be input into a basic Transformers model for encoding and decoding, and finally the basic Transformers model outputs the first An interactive feature.
  • the server inputs the intermediate interaction features into the Transformers model
  • the Transformers model includes N cascaded encoders and N cascaded decoders, and calls the N cascaded encoders to encode the intermediate interaction features
  • the encoded features are input to N cascaded decoders for decoding to obtain the first interaction features.
  • each of the N cascaded encoders includes a multi-head attention layer and a feed-forward neural network layer
  • the multi-head attention layer is used to comprehensively extract each moment from multiple expression subspaces
  • the correlation relationship between the feature vectors below, the feedforward neural network layer is used to fully connect the feature vectors output by the multi-head attention layer, and a residual structure is set after the multi-head attention layer and the feedforward neural network layer, and also That is, the input and output of the current layer are residually connected (that is, spliced) and normalized before being input to the next layer.
  • the input vector is encoded by N cascaded encoders, and the features output by the last encoder are input into N cascaded decoders.
  • Each of the N cascaded decoders includes a masked multi-head attention layer, a fusion multi-head attention layer and a feed-forward neural network layer.
  • the masked multi-head attention layer is similar to the multi-head attention layer, but The masked multi-head attention layer only pays attention to the translation results before the current moment, so it is necessary to perform mask (occlusion) processing on the translation results after the current moment, and the fusion multi-head attention layer is also similar to the multi-head attention layer, but the fusion multi-head attention
  • the force layer also takes the output of the feedforward neural network layer of the encoder corresponding to the serial number (referring to the result of residual connection and normalization) as input.
  • this design is to pay attention to the encoded information of the encoder.
  • the decoder predicts the interaction characteristics of the next moment by looking at the output of the encoder and self-attention to its own output.
  • the previous step of the decoder The feed-forward neural network layer is similar to the feed-forward neural network layer of the encoder, and will not be described here.
  • the masked multi-head attention layer, the fused multi-head attention layer, and the feed-forward neural network layer of the decoder are also set with residuals Structure, that is, the input and output of the current layer are residually connected (that is, spliced) and normalized before being input to the next layer.
  • the number of cascaded encoders needs to be consistent with the number of cascaded decoders.
  • the encoded features can be decoded by N cascaded decoders, and the last decoder outputs the first interaction feature.
  • the server determines a second interaction feature based on some features associated with the picture feature in the text feature, where the second interaction feature is used to characterize the picture feature combined with the text feature.
  • the picture mode is used as the main mode
  • the auxiliary information of the text mode is received, that is to say, the server obtains some features associated with the picture features in the text features, and uses the cross-modal interaction model to The image feature and the partial feature are processed to obtain the second interaction feature.
  • the cross-model interaction model includes, but is not limited to: Transformers model or variants of Transformers model.
  • the above step 305 is similar to the above step 304, except that the main mode ⁇ is changed to a picture mode, and the auxiliary mode ⁇ is changed to a text mode, which will not be described here.
  • the second interaction feature is directly output from the cross-mode Transformers model, which can reduce the computational complexity of obtaining the second interaction feature.
  • the features output by the cross-modal Transformers model are used as intermediate interaction features, and the intermediate interaction features are input into a basic Transformers model for encoding and then decoding, and finally the basic Transformers model outputs the second interaction features.
  • the server inputs the intermediate interaction features into the Transformers model, the Transformers model includes N cascaded encoders and N cascaded decoders, and calls the N cascaded encoders to encode the intermediate interaction features,
  • the encoded features are input to N cascaded decoders for decoding to obtain the first interaction features.
  • the internal processing logic of each encoder and decoder in the basic Transformers model has been introduced in step 304 above, and will not be repeated here.
  • step 304 may be executed first, followed by step 305, step 305 may be executed first, and then step 304 may be executed, or step 304 and step 305 may be executed simultaneously, and this application is not concerned with step 304 and step 305.
  • the execution order of 305 is not limited.
  • the server fuses the first interaction feature with the second interaction feature to obtain a cross-modal interaction feature.
  • the server may concatenate the first interaction feature and the second interaction feature to obtain the final cross-modal interaction feature, thereby reducing the amount of calculation during feature fusion.
  • the server may perform element-wise addition, element-wise multiplication, or bilinear fusion of the first interaction feature and the second interaction feature, so that the features are more fully fused.
  • This embodiment of the present application does not The feature fusion method is specifically limited.
  • the server Based on the cross-modal interaction feature, the server,
  • the above-mentioned step 307 is similar to the above-mentioned step 205, and will not be repeated here.
  • Fig. 6 is a schematic diagram of a multimodal fusion network combined with relative position coding provided by the embodiment of the present application.
  • the multimodal fusion network includes a text coding model 601, a picture coding model 602 and a cross- Module interaction part 603.
  • the text encoding model 601 can be formed by cascading the BERT model 6011 obtained by the fine-tuning of the basic BERT model (Finetune) and the encoder 6012 of the Transformers model, and input the character sequence of the text data (referred to as the text sentence sequence) into the BERT model 6011, output a semantic vector sequence, input the semantic vector sequence to the encoder 6012 of the Transformers model, output the text semantic features of the text data, and input the text semantic features into a 1-dimensional convolutional layer (Conv1D) layer for dimension transformation , and splicing with the text position features to obtain the text features of the text data.
  • Conv1D 1-dimensional convolutional layer
  • the image coding model 602 is a MobileNet model obtained by pre-training, input the image sequence of the image data into the image encoding model 602, and output the image depth features of the image data, input the image depth features into the Conv1D layer for dimension transformation, and then combine with the image The position features are spliced to obtain the picture features of the picture data.
  • the cross-mode interaction part 603 includes 2 cross-mode Transformers models and 2 basic Transformers models.
  • the cross-modal Transformers model is used to extract the intermediate interaction features from the image mode to the text mode, and the intermediate interaction features are input into the basic Transformers model for encoding and then decoding, and the first interaction feature is output.
  • the cross-modal Transformers model is used to extract the intermediate interaction features from the text mode to the picture mode, and the intermediate interaction features are input into the basic Transformers model for encoding and then decoding, and the second interaction features are output.
  • first interaction feature and the second interaction feature are spliced to obtain the final cross-modal interaction feature between the two modalities, and then the cross-modal interaction feature is used to predict the article category to which the article finally belongs (Classification) .
  • the relative position encoding method introduced in the above step 303 is adopted, then it is necessary to modify the absolute position feature of each Transformers model in the cross-model interaction part 603 to a relative position feature, for example, to separate the original character Embedding( Embedding vector) and position Embedding (position vector), after expanding the columnar formula, the position vector of the absolute position encoding method is converted into the position vector of the relative position encoding method, that is, the relative position relationship is integrated into the interactive calculation of any two modes. in the self-attention layer.
  • the self-attention layer is usually expressed as:
  • Attention(Q,K,V) refers to the attention coefficient calculated based on the Q(Query) matrix, K(Key) matrix and V(Value) matrix, softmax() refers to the exponential normalization function, and Q is Refers to the Q matrix of the current character, K refers to the K matrix of the current character, V refers to the V matrix of the current character, K T refers to the transposition matrix of the K matrix, is the scaling factor.
  • E represents the text vector
  • U refers to the position vector
  • W refers to the parameter matrix
  • the transpose matrix of text vectors representing the ith element in modal 1 Represents the transpose matrix of the parameter matrix of the Q matrix
  • W k represents the parameter matrix of the K matrix
  • U j represents the position vector of the jth element in modal 2
  • Transpose matrix representing the position vector of the ith element in modal 1.
  • E represents the text vector
  • U refers to the position vector
  • W refers to the parameter matrix
  • the transpose matrix of text vectors representing the ith element in modal 1 Represents the transpose matrix of the parameter matrix of the Q matrix, W k, E represents the parameter matrix related to the K matrix and the text vector under the relative position encoding, Represents the text vector of the jth element in modal 2
  • R ij represents the relative position encoding vector between the i th element in modal 1 and the j th element in modal 2
  • W K, R represent the relative position encoding
  • the parameter matrix related to the K matrix and the relative position encoding vector, u T and v T respectively represent the parameter vector to be learned that is independent of the position of the i-th element in modality 1
  • the method provided in the embodiment of the present application extracts text features and picture features from the text data and picture data of the article respectively, and uses the cross-modal interaction features between the two to predict the article category to which the article belongs. At the same time, the contribution of the text mode and the picture mode to the article category is considered, instead of judging only from the text point of view.
  • the extracted cross-modal interaction features are not a simple splicing of text features and picture features. It can It reflects richer and deeper interactive information between modalities, which helps to improve the recognition accuracy of article categories, and then improves the mining accuracy of high-quality articles in the scenario of identifying high-quality articles.
  • FIG. 7 is a flow chart of a data processing method provided by an embodiment of the present application. Referring to Fig. 7, this embodiment is executed by a computer device, and the computer device is used as an example for illustration. For the situation that the article is divided into three modes: title, text and picture, in the embodiment of this application, it will be introduced in detail how to The article category of mode fusion mode identification article, this embodiment comprises the following steps:
  • the server acquires title data, text data, and image data in the article.
  • title data and text data may be collectively referred to as text data.
  • the foregoing step 701 is similar to the foregoing step 301, and details are not repeated here.
  • the server may further extract title data and text data from the text data.
  • the server extracts the title semantic feature of the title data, and fuses the title semantic feature with the title position feature to obtain the title feature of the title data.
  • the server extracts the title semantic features based on a title encoding model, and the title encoding model is used to extract the title semantic features of the title data, that is, the server inputs the title data into the title encoding model by
  • the title encoding model encodes the title data to extract semantic features of the title.
  • the model structure of the title encoding model includes but is not limited to: BERT model, Transformers model, ELMo model, NNLM model, etc. The embodiment of the present application does not specifically limit the model structure of the title encoding model.
  • the server can perform word segmentation processing on each title to obtain at least one character contained in each title, and convert each title
  • Each character can be arranged in the order in which it appears in the article to form a character sequence.
  • [SEP] is added at the end of each title as a sentence separator
  • [CLS] is added at the beginning of the character sequence.
  • the sentence separator is used to segment sentences between adjacent titles, and the classifier is used to represent the globalized semantic information of the entire character sequence.
  • the BERT model includes an embedding layer and at least one bidirectional encoding layer, each bidirectional encoding layer is used for forward encoding and reverse encoding of the input signal, and the output of each bidirectional encoding layer is used as The input of the next bidirectional encoding layer, that is, the serial connection between each bidirectional encoding layer.
  • Each bidirectional encoding layer includes two parts, one part is the attention network, and the other part is the forward fully connected layer.
  • Each hidden layer in the attention network is obtained by weighted average of the hidden layer of the previous layer, so that each A hidden layer can be directly related to all hidden layers of the previous layer, and a hidden layer vector for representing global information can be obtained by using the input long sequence information (that is, the character sequence), while the forward fully connected layer is It is used to further process the global information obtained by the attention network to enhance the learning ability of the entire BERT model.
  • the character sequence can be input into the embedding layer of the BERT model first, and each character in the character sequence is embedded through the embedding layer.
  • each character is mapped to the embedding space to obtain the embedding vector of each character , that is, a sequence of embedding vectors is obtained.
  • input the embedding vector sequence into the at least one bidirectional encoding layer perform bidirectional encoding (including forward encoding and reverse encoding) on each embedding vector in the embedding vector sequence through the at least one bidirectional encoding layer, and output each The semantic vector of the character, that is, a sequence of semantic vectors is obtained, and finally, the last bidirectional encoding layer outputs the title semantic features of the title data.
  • Each character in the character sequence corresponds to an embedding vector in the embedding vector sequence
  • each embedding vector in the embedding vector sequence corresponds to a semantic vector in the semantic vector sequence.
  • the embedded vector sequence is forward-encoded and reverse-encoded respectively through the bidirectional encoding layer.
  • forward encoding the semantic vector corresponding to each character can be fused with the relevant information of the character that appeared before the character, and through Reverse encoding enables the semantic vector corresponding to each character to fuse the relevant information of the characters that appear after the character, and the encoding operation in two directions can greatly improve the expressive ability of the semantic vector of each character.
  • the first bidirectional encoding layer Take the first bidirectional encoding layer as an example, which includes the attention network and the forward fully connected layer. Input the embedding vector sequence into the attention network of the first bidirectional encoding layer, weight the embedding vector sequence through the attention network to extract the attention feature sequence of the embedding vector sequence, and input the attention feature sequence
  • the attention feature sequence is bidirectionally encoded (including forward encoding and reverse encoding) through the forward fully connected layer, and a hidden vector sequence is output.
  • the hidden vector sequence is input to the second bidirectional coding layer, and so on.
  • the processing logic of the subsequent bidirectional coding layer is similar to that of the first bidirectional coding layer, so I won’t go into details here.
  • each character can be focused on the character that is more related to itself (closer relationship) each time the semantic encoding is performed, so that the semantic vector of each character finally obtained has a higher accuracy.
  • the server may also obtain the title position feature of the title data, and the title position feature is used to characterize the sequence of positions of each character in the title data.
  • the position information of each character in the character sequence is encoded to obtain the title position feature of the title data.
  • the title semantic feature and the title position feature are concatenated to obtain the title feature of the title data.
  • the server when it encodes the location information of each character, it may adopt an absolute position encoding method or a relative position encoding method, and the embodiment of the present application does not specifically limit the encoding method of the position information.
  • the two position encoding methods have been introduced in the previous embodiment, and will not be repeated here.
  • a 1-dimensional convolutional layer can be used to perform dimension transformation on the title semantic feature (that is, increase or decrease the dimension), so that the dimension-transformed title semantic feature is the same as the title position feature, so that the dimension-transformed title semantic feature and title position feature are spliced to obtain the title feature of the title data.
  • the 1-dimensional convolutional layer refers to a convolutional layer with a convolution kernel size of 1 ⁇ 1.
  • the server extracts the text semantic feature of the text data, and fuses the text semantic feature with the text location feature to obtain the text feature of the text data.
  • the server extracts the text semantic features based on the text encoding model, and the text encoding model is used to extract the text semantic features of the text data, that is to say, the server inputs the text data into the text encoding model, through
  • the text encoding model encodes the text data to extract semantic features of the text.
  • the model structure of the text encoding model includes, but is not limited to, any one or a combination of at least two of the following: BERT model, Transformers model, ELMo model, NNLM model, etc., the embodiment of the present application does not include the text encoding model
  • the model structure is specifically defined.
  • the text encoding model can be formed by cascading the encoders of the BERT model and the Transformers model.
  • the text encoding model of this structure processes the text data in the same way as the text encoding model in step 302 above processes the text data. Similar, do not repeat them here.
  • the server may also acquire text position features of the text data, where the text position features are used to characterize the sequence of positions of characters in the text data.
  • the position information of each character in the character sequence is encoded to obtain the text position feature of the text data.
  • the semantic features of the text and the position features of the text are concatenated to obtain the text features of the text data.
  • the server when it encodes the location information of each character, it may adopt an absolute position encoding method or a relative position encoding method, and the embodiment of the present application does not specifically limit the encoding method of the position information.
  • the two position encoding methods have been introduced in the previous embodiment, and will not be repeated here.
  • a 1-dimensional convolutional layer can be used to perform dimension transformation on the text semantic feature (that is, increase or decrease the dimension), so that the semantic features of the text after dimension transformation are the same as the dimension of the text position features, so that the text semantic features after dimension transformation and the text position features are spliced to obtain the text features of the text data.
  • the 1-dimensional convolutional layer refers to a convolutional layer with a convolution kernel size of 1 ⁇ 1.
  • addition by elements, multiplication by elements, and bilinear fusion can also be used to fuse the semantic features of the text and the position features of the text.
  • the implementation of the present application This example does not specifically limit the fusion method.
  • the server extracts the text semantic features of the text data, fuses the text semantic features with the text position features, and obtains the text
  • a possible implementation of the text feature of the data by dividing the text data into title data and text data, more and richer feature information can be extracted.
  • the server extracts the picture depth feature of the picture data, and fuses the picture depth feature with the picture position feature to obtain the picture feature of the picture data.
  • the foregoing step 704 is similar to the foregoing step 303, and details are not described here.
  • the server determines the title interaction feature based on the text feature and the part of the picture feature that are respectively associated with the title feature, and the title interaction feature is used to represent the title feature after combining the text feature and the picture feature .
  • the server determines the first title interaction feature based on some features associated with the title feature in the text feature, that is to say, the title mode is the main mode to receive auxiliary information of the text mode.
  • the server acquires some features associated with the title feature in the text feature, and uses a cross-model interaction model to process the title feature and the part feature to obtain the first title interaction feature.
  • the cross-modal interaction model includes but is not limited to: Transformers model or variants of the Transformers model, for example, the cross-modal interaction model can be a cross-modal Transformers model, the main mode ⁇ is determined as the title mode, and the auxiliary mode Mode ⁇ is determined to be the text mode, and the model structure and processing logic are similar to the above-mentioned step 304, which will not be repeated here.
  • the server determines the second title interaction feature based on some of the picture features associated with the title feature, that is to say, the title mode is the main mode, and the auxiliary information of the picture mode is received.
  • the server acquires part of the image features associated with the title feature, and uses a cross-model interaction model to process the title feature and the part of the features to obtain the second title interaction feature.
  • the cross-modal interaction model includes but is not limited to: Transformers model or variants of the Transformers model, for example, the cross-modal interaction model can be a cross-modal Transformers model, the main mode ⁇ is determined as the title mode, and the auxiliary mode State ⁇ is determined to be an image modality, and the model structure and processing logic are similar to the above-mentioned step 304, which will not be repeated here.
  • the server concatenates the first title interaction feature with the second title interaction feature to obtain a third title interaction feature, which can reduce calculations when fusing the first title interaction feature with the second title interaction feature.
  • the complexity optionally, may also adopt fusion methods such as element-wise addition, element-wise multiplication, and bilinear fusion, which are not specifically limited in this embodiment of the present application.
  • the server encodes and decodes the third title interaction feature to obtain the title interaction feature.
  • the server inputs the third title interaction feature into the Transformers model, the Transformers model includes N cascaded encoders and N cascaded decoders, and calls the N cascaded encoders to interact with the third title
  • the feature is encoded to obtain the interactive feature of the intermediate title, and the interactive feature of the intermediate title is input to N cascaded decoders for decoding to obtain the interactive feature of the title.
  • each of the N cascaded encoders includes a multi-head attention layer and a feed-forward neural network layer
  • the multi-head attention layer is used to comprehensively extract title data from multiple expression subspaces
  • the association relationship between the characters in the feed-forward neural network layer is used to fully connect the feature vectors output by the multi-head attention layer, and a residual structure is set after the multi-head attention layer and the feed-forward neural network layer, that is, The input and output of the current layer are residually connected (that is, concatenated) and normalized before being input to the next layer.
  • the input vector is encoded by N cascaded encoders, and the intermediate title interaction feature is output by the last encoder.
  • Each of the N cascaded decoders includes a masked multi-head attention layer, a fusion multi-head attention layer and a feed-forward neural network layer.
  • the masked multi-head attention layer is similar to the multi-head attention layer, but The masked multi-head attention layer only pays attention to the translation results before the current moment, so it is necessary to perform mask (occlusion) processing on the translation results after the current moment, and the fusion multi-head attention layer is also similar to the multi-head attention layer, but the fusion multi-head attention
  • the force layer also takes the output of the feedforward neural network layer of the encoder corresponding to the serial number (referring to the result of residual connection and normalization) as input.
  • this design is used to pay attention to the encoding information of the encoder.
  • the decoder predicts the interaction characteristics of the next moment by looking at the output of the encoder and self-attention to its own output.
  • the decoder's The feed-forward neural network layer is similar to the feed-forward neural network layer of the encoder, and will not be described here.
  • the masked multi-head attention layer, the fusion multi-head attention layer, and the feed-forward neural network layer of the decoder are also set with residual
  • the difference structure that is, the input and output of the current layer are residually connected (that is, spliced) and normalized before being input to the next layer.
  • the number of cascaded encoders needs to be consistent with the number of cascaded decoders.
  • the intermediate title interaction feature can be decoded by N cascaded decoders, and the final title interaction feature is output by the last decoder.
  • the server determines the text interaction feature based on the title feature and the part of the picture feature associated with the text feature, and the text interaction feature is used to represent the text feature after the title feature and the picture feature are fused .
  • steps 705-706 show possible implementations of how to obtain the first interaction feature.
  • the server determines the first text interaction feature based on some features associated with the text feature in the title feature, that is to say, the text mode is the main mode, and the auxiliary information of the title mode is received.
  • the server acquires part of the features associated with the text feature in the title feature, and uses a cross-model interaction model to process the text feature and the part of the feature to obtain the first text interaction feature.
  • the cross-modal interaction model includes but is not limited to: Transformers model or Transformers model variants, for example, the cross-modal interaction model can be a cross-modal Transformers model, the main mode ⁇ is determined as the main text mode, and the auxiliary mode State ⁇ is determined to be the title mode, and the model structure and processing logic are similar to the above-mentioned step 304, which will not be repeated here.
  • the server determines the second text interaction feature based on some features associated with the text feature in the picture feature, that is to say, the text mode is the main mode, and the auxiliary information of the picture mode is received.
  • the server acquires part of the image features associated with the text feature, and uses a cross-model interaction model to process the text feature and the part of the feature to obtain the second text interaction feature.
  • the cross-modal interaction model includes but is not limited to: Transformers model or Transformers model variants, for example, the cross-modal interaction model can be a cross-modal Transformers model, the main mode ⁇ is determined as the main text mode, and the auxiliary mode State ⁇ is determined to be an image modality, and the model structure and processing logic are similar to the above-mentioned step 304, which will not be repeated here.
  • the server splices the first text interaction feature and the second text interaction feature to obtain the third text interaction feature, which can reduce the cost of fusing the first text interaction feature and the second text interaction feature.
  • Computational complexity optionally, fusion methods such as element-wise addition, element-wise multiplication, and bilinear fusion may also be adopted, which is not specifically limited in this embodiment of the present application.
  • the server encodes and decodes the third text interaction feature to obtain the text interaction feature.
  • the server inputs the third text interaction feature into the Transformers model, encodes the third text interaction feature through N cascaded encoders in the Transformers model to obtain the intermediate text interaction feature, and inputs the intermediate text interaction feature Decode in N cascaded decoders to obtain the interactive features of the text.
  • the server determines a second interaction feature based on the title feature and the part of the text features that are respectively associated with the picture feature, and the second interaction feature is used to represent the fusion of the title feature and the text feature.
  • Picture features are used to represent the fusion of the title feature and the text feature.
  • the server determines the first picture interaction feature based on some features associated with the picture feature in the title feature, that is to say, the picture mode is the main mode, and the auxiliary information of the title mode is received.
  • the server acquires part of the features associated with the image feature in the title feature, and processes the image feature and the part of the feature using a cross-modal interaction model to obtain the first image interaction feature.
  • the cross-mode interaction model includes but is not limited to: Transformers model or Transformers model variants, for example, the cross-mode interaction model can be a cross-mode Transformers model, the main mode ⁇ is determined as the picture mode, and the auxiliary mode State ⁇ is determined to be the title mode, and the model structure and processing logic are similar to the above-mentioned step 304, which will not be repeated here.
  • the server determines the second picture interaction feature based on some features associated with the picture feature in the text feature, that is to say, the picture mode is the main mode, and the auxiliary information of the text mode is received.
  • the server acquires part of the features associated with the picture feature in the text feature, and processes the picture feature and the part of the feature using a cross-model interaction model to obtain the second picture interaction feature.
  • the cross-mode interaction model includes but is not limited to: Transformers model or Transformers model variants, for example, the cross-mode interaction model can be a cross-mode Transformers model, the main mode ⁇ is determined as the picture mode, and the auxiliary mode Mode ⁇ is determined to be the text mode, and the model structure and processing logic are similar to the above-mentioned step 304, which will not be repeated here.
  • the server splices the first picture interaction feature and the second picture interaction feature to obtain the third picture interaction feature, which can reduce the cost of fusing the first picture interaction feature and the second picture interaction feature.
  • Computational complexity optionally, fusion methods such as element-wise addition, element-wise multiplication, and bilinear fusion may also be adopted, which is not specifically limited in this embodiment of the present application.
  • the server encodes and decodes the third picture interaction feature to obtain the second interaction feature.
  • the server inputs the interaction feature of the third picture into the Transformers model, encodes the interaction feature of the third picture through N cascaded encoders in the Transformers model, obtains the interaction feature of the intermediate picture, and inputs the interaction feature of the intermediate picture Decode in N cascaded decoders to obtain the second interaction feature.
  • the server fuses the title interaction feature, the text interaction feature, and the second interaction feature to obtain a cross-modal interaction feature.
  • the server concatenates the title interaction feature, the text interaction feature and the second interaction feature to obtain the final cross-modal interaction feature between the three modalities, thereby reducing the amount of calculation during feature fusion.
  • the server can fuse the title interaction feature, the text interaction feature and the second interaction feature by means of element-wise addition, element-wise multiplication, or bilinear fusion, so that the feature fusion can be More fully, the embodiment of the present application does not specifically limit the feature fusion manner.
  • a possible implementation method for the server to obtain cross-modal interaction features is provided, that is, by dividing text data into title data and text data, the original two-modal fusion is extended to three-modal Fusion can make full use of the sequence-level interaction information between modals, and carry out directional cross-modal attention weighting on the three modals of title, text and picture (a total of 6 combinations). It will be used as the main mode to receive the auxiliary information of the other two modes, which greatly improves the expression ability of the finally obtained cross-modal interaction features, and greatly improves the accuracy of the final prediction based on cross-modal interaction features.
  • the server determines the article category to which the article belongs based on the cross-modal interaction feature.
  • the above-mentioned step 709 is similar to the above-mentioned step 205, and will not be repeated here.
  • Fig. 8 is a schematic diagram of the principle of a multi-modal fusion network provided by the embodiment of the present application. As shown in Fig. Model interaction part 804 .
  • the title encoding model 801 is a BERT model obtained by fine-tuning the basic BERT model (Finetune).
  • the character sequence of the title data (referred to as the title sequence) is input into the title encoding model 801, and the title semantic features of the title data are output.
  • the semantic features are input into a 1-dimensional convolutional layer (Conv1D) for dimension transformation, and then spliced with the title position features to obtain the title features of the title data.
  • Conv1D 1-dimensional convolutional layer
  • the text encoding model 802 is formed by cascading the fine-tuned BERT model 8021 and the encoder 8022 of the Transformers model.
  • the character sequence of the text data (referred to as the text sentence sequence) is input into the BERT model 8021, and a semantic Vector sequence, input the semantic vector sequence to the encoder 8022 of the Transformers model, output the text semantic features of the text data, input the text semantic features into the Conv1D layer for dimension transformation, and splicing with the text position features to obtain the text data Text features.
  • the picture coding model 803 is a MobileNet model obtained by pre-training, input the picture sequence of the picture data into the picture coding model 803, output the picture depth features of the picture data, input the picture depth features into the Conv1D layer for dimension transformation, and combine with the picture The position features are spliced to obtain the picture features of the picture data.
  • the cross-mode interaction part 804 includes 6 cross-mode Transformers models and 3 basic Transformers models. Taking the title mode as the main mode, the cross-modal Transformers model is used to extract the interaction features of the first title from the text mode ⁇ the title mode, and the interaction features of the second title from the picture mode ⁇ the title mode. A title interaction feature is concatenated with the second title interaction feature to obtain a third title interaction feature, and the third title interaction feature is input into the Transformers model for first encoding and then decoding, and the title interaction feature is output. In addition, with the text mode as the main mode, the cross-modal Transformers model is used to extract the first text interaction features from the title mode ⁇ the text mode, and the second text interaction features from the image mode ⁇ the text mode.
  • the first text interaction feature and the second text interaction feature are spliced to obtain the third text interaction feature, and the third text interaction feature is input into the Transformers model for encoding and then decoding, and the text interaction feature is output.
  • the cross-modal Transformers model is used to extract the first image interaction features from the title mode ⁇ image mode, and the second image interaction features from the text mode ⁇ image mode, respectively.
  • the interaction features of the first picture and the interaction features of the second picture are spliced to obtain the interaction features of the third picture, and the interaction features of the third picture are input into the Transformers model to be first encoded and then decoded, and the second interaction features are output.
  • the title interaction feature, the text interaction feature and the second interaction feature are spliced to obtain the final cross-modal interaction feature among the three modalities, and then the cross-modal interaction feature is used to predict the final article to which Article category (Classification).
  • Article category Classification
  • a relative position encoding manner may also be introduced based on a manner similar to that in the foregoing embodiment, which will not be described in detail here.
  • the above-mentioned multi-modal fusion network constructs a cross-modal interaction method in the case of non-alignment of high-quality image-text recognition. For the cross-modal interaction part of the three modalities, the sequence-level interaction information between modalities can be fully utilized.
  • the Transformers model based on self-attention is used to continue modeling in conjunction with the context (Context), and finally splice three sets of features (the title interaction feature, the text interaction features and the second interaction features) to predict, and through longitudinal comparison experiments, it is found that the model effect is the best in the three-way combination scene of title, text, and picture, that is, the interaction information between any two modals has a significant effect on the model enhancement.
  • the above-mentioned multimodal fusion network combined with the relative position coding method can be applied to the scene of recognizing high-quality graphics and texts.
  • the modal interaction between adjacent graphics and texts is very important.
  • by introducing the relative position coding method It can enhance the learning of the relative positional relationship between text and image sequences, thereby improving the recognition accuracy of the overall model.
  • the overall matching effect between pictures and text is also crucial.
  • the above-mentioned multi-modal fusion network combined with the relative position coding method completed the graph Construction of multimodal modules in high-quality text recognition scenarios.
  • the model evaluation accuracy rate reached 95%, while the traditional supervised means of identifying high-quality graphic content, such as judging content quality only from the perspective of text, or text Embedding
  • the consideration dimension is very single, and the modal interaction information between adjacent text and pictures cannot be learned.
  • the result is that the overall accuracy rate is lower than 95%. Therefore, this application
  • the method provided in the embodiment can greatly improve the recognition accuracy for article categories.
  • the coverage rate of high-quality content with graphics and texts reached 17%.
  • the content is recommended to users first, and has achieved good business results compared with historical application versions on the business side.
  • the content quality is scored for all graphic content, and then released and distributed to the terminal side, and the terminal side performs hierarchical recommendation weighting according to the content quality score, for example, for
  • the identified high-quality content is recommended for weighting, and the recommendation for low-quality content is reduced.
  • This recommendation method can effectively improve the user's reading experience, and is an innovation in the recommendation algorithm based on specific business scenarios.
  • the method provided in the embodiment of the present application extracts text features and picture features from the text data and picture data of the article respectively, and uses the cross-modal interaction features between the two to predict the article category to which the article belongs. At the same time, the contribution of the text mode and the picture mode to the article category is considered, instead of judging only from the text point of view.
  • the extracted cross-modal interaction features are not a simple splicing of text features and picture features. It can It reflects richer and deeper interactive information between modalities, which helps to improve the recognition accuracy of article categories, and then improves the mining accuracy of high-quality articles in the scenario of identifying high-quality articles.
  • Fig. 9 is a schematic structural diagram of a data processing device provided by an embodiment of the present application, please refer to Fig. 9, the device includes:
  • the first obtaining module 901 is used to obtain text features and picture features of the article, the text features are used to represent the text data in the article, and the picture features are used to represent the picture data in the article;
  • the second acquisition module 902 is configured to determine a first interaction feature for the text feature based on some features associated with the text feature in the picture feature, and the first interaction feature is used to characterize the text feature fused with the picture feature;
  • the third acquisition module 903 is configured to determine a second interaction feature for the picture feature based on some features associated with the picture feature in the text feature, and the second interaction feature is used to characterize the picture feature fused with the text feature;
  • a fusion module 904 configured to fuse the first interaction feature with the second interaction feature to obtain a cross-modal interaction feature
  • the determination module 905 is configured to determine the article category to which the target article belongs based on the cross-modal interaction feature.
  • the device provided in the embodiment of the present application extracts text features and picture features from the text data and picture data of the article respectively, and uses the cross-modal interaction features between the two to predict the article category to which the article belongs.
  • the method At the same time, the contribution of the text mode and the picture mode to the article category is considered, instead of judging only from the perspective of the text.
  • the extracted cross-modal interaction features are not a simple splicing of text features and picture features. It reflects richer and deeper interactive information between modalities, which helps to improve the recognition accuracy of article categories, and then improves the mining accuracy of high-quality articles in the scenario of identifying high-quality articles.
  • the first acquisition module 901 includes:
  • the first extraction and fusion unit is used to extract the text semantic features of the text data, and fuse the text semantic features and text position features to obtain the text features;
  • the second extraction and fusion unit is used to extract the picture depth feature of the picture data, and fuse the picture depth feature with the picture position feature to obtain the picture feature.
  • the text data includes title data and text data
  • the text features include title features and text features
  • the first extraction fusion unit is used for:
  • the title feature is obtained by fusing the title semantic feature with the title position feature;
  • the text feature is obtained by fusing the semantic feature of the text with the position feature of the text.
  • the first interaction feature includes a title interaction feature and a text interaction feature.
  • the second acquisition module 902 includes:
  • the first acquisition unit is used to determine the title interaction feature based on the text feature and the part of the picture feature associated with the title feature for the title feature, and the title interaction feature is used to represent the combination of text features and pictures. the title feature after the feature;
  • the second acquisition unit is used to determine the text interaction feature based on the title feature and the part of the picture feature associated with the text feature for the text feature, and the text interaction feature is used to represent the combination of the title feature and the picture feature The body feature after the feature.
  • the first acquisition unit is used for:
  • the third title interaction feature is encoded and decoded to obtain the title interaction feature.
  • the second acquisition unit is used for:
  • the third text interaction feature is encoded and decoded to obtain the text interaction feature.
  • the third obtaining module 903 includes:
  • the third acquisition unit is configured to determine the second interaction feature for the picture feature based on the title feature and the partial features of the text feature that are respectively associated with the picture feature.
  • the third acquisition unit is used for:
  • the third picture interaction feature is encoded and decoded to obtain the second interaction feature.
  • both the text position feature and the picture position feature are relative position features between the text data and the picture data, and the relative position feature is used to represent the relationship between the text data and the picture data. Priority and distance.
  • the method for determining the relative position feature includes:
  • a relative position encoding matrix is constructed, and any element in the relative position encoding matrix is used to represent the relative position information between the text corresponding to the column to which the element belongs and the picture corresponding to the row to which the element belongs;
  • a relative position feature between any text in the plurality of texts and any picture in the plurality of pictures is determined.
  • the determination module 905 is used to:
  • the cross-modal interaction feature is fully connected to obtain the fully connected feature
  • Exponential normalization is performed on the fully connected feature to obtain the probability prediction result of the article; the probability prediction result includes multiple prediction probabilities, and the multiple prediction probabilities correspond to multiple categories one by one;
  • the data processing device provided by the above-mentioned embodiments processes data, it only uses the division of the above-mentioned functional modules as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. The internal structure of the computer equipment is divided into different functional modules to complete all or part of the functions described above.
  • the data processing device and the data processing method embodiment provided by the above embodiment belong to the same idea, and the specific implementation process thereof is detailed in the data processing method embodiment, and will not be repeated here.
  • FIG. 10 is a schematic structural diagram of a computer device provided by an embodiment of the present application. Please refer to FIG. 10 , and the computer device is used as an example to illustrate the terminal 1000. At this time, the terminal 1000 can independently complete the process of identifying the article category of the article.
  • the device types of the terminal 1000 include: smart phones, tablet computers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Motion Picture Expert compresses standard audio levels 4) Players, laptops or desktops.
  • the terminal 1000 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 1000 includes: a processor 1001 and a memory 1002 .
  • the processor 1001 includes one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1001 adopts at least one of DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) implemented in the form of hardware.
  • the processor 1001 includes a main processor and a coprocessor, and the main processor is a processor for processing data in a wake-up state, also called a CPU (Central Processing Unit, central processing unit);
  • a coprocessor is a low-power processor for processing data in a standby state.
  • the processor 1001 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1001 further includes an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is configured to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • memory 1002 includes one or more computer-readable storage media, which are optionally non-transitory.
  • the memory 1002 also includes a high-speed random access memory and a non-volatile memory, such as one or more disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 1002 is used to store at least one program code, and the at least one program code is used to be executed by the processor 1001 to implement the various embodiments provided in the present application. data processing method.
  • the terminal 1000 may optionally further include: a peripheral device interface 1003 and at least one peripheral device.
  • the processor 1001, the memory 1002, and the peripheral device interface 1003 can be connected through buses or signal lines.
  • Each peripheral device can be connected to the peripheral device interface 1003 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1004 , a display screen 1005 , a camera component 1006 , an audio circuit 1007 , a positioning component 1008 and a power supply 1009 .
  • the terminal 1000 further includes one or more sensors 1010 .
  • the one or more sensors 1010 include, but are not limited to: an acceleration sensor 1011 , a gyroscope sensor 1012 , a pressure sensor 1013 , a fingerprint sensor 1014 , an optical sensor 1015 and a proximity sensor 1016 .
  • FIG. 10 does not constitute a limitation on the terminal 1000, and may include more or less components than shown in the figure, or combine certain components, or adopt a different component arrangement.
  • Fig. 11 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 1100 may have relatively large differences due to different configurations or performances.
  • the computer device 1100 includes one or more than one processor (Central Processing Units, CPU) 1101 and one or more memories 1102, wherein at least one computer program is stored in the memory 1102, and the at least one computer program is loaded and executed by the one or more processors 1101 to implement the above-mentioned various embodiments. data processing method.
  • the computer device 1100 also has components such as a wired or wireless network interface, a keyboard, and an input and output interface for input and output.
  • the computer device 1100 also includes other components for implementing device functions, which will not be described in detail here.
  • a computer-readable storage medium such as a memory including at least one computer program
  • the at least one computer program can be executed by a processor in the terminal to complete the data processing methods in the above-mentioned embodiments .
  • the computer-readable storage medium includes ROM (Read-Only Memory, read-only memory), RAM (Random-Access Memory, random-access memory), CD-ROM (Compact Disc Read-Only Memory, read-only disc), Magnetic tapes, floppy disks, and optical data storage devices, etc.
  • a computer program product or computer program comprising one or more pieces of program code stored in a computer readable storage medium.
  • One or more processors of the computer device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute to complete The data processing method in the above-mentioned embodiment.
  • the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program.
  • the program is stored in a computer-readable storage medium.
  • the storage medium mentioned above is a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种数据处理方法、装置、计算机设备及存储介质,属于计算机技术领域。本申请通过针对文章的文本数据和图片数据,分别提取文本特征和图片特征,并利用两者之间的跨模态交互特征,来预测该文章所属的文章类别,同时考虑了文本模态和图片模态各自对于文章类别的贡献程度,而并非仅从文本角度来进行判断,此外所提取到的跨模态交互特征并非是文本特征和图片特征的简单拼接,能够反映出更加丰富和深层次的模态间交互信息,大大提高了对文章类别的识别准确率,进而在识别优质文章的场景下能够提高对优质文章的挖掘准确率。

Description

数据处理方法、装置、计算机设备及存储介质
本申请要求于2021年09月22日提交中国专利局、申请号为2021111061865、申请名称为“数据处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及数据处理技术。
背景技术
随着计算机技术的发展,优质文章的发掘任务逐渐成为研究热点,通过该任务可以发掘优质文章,并将优质文章推送给用户,提高用户的阅读体验。目前,执行上述优质文章的发掘任务时,通常从文本角度判断文章的内容质量(而不考虑图片对内容质量的贡献),或者,对文本和图片各自的嵌入(embedding)特征进行拼接(concat),基于拼接得到的特征来判定其是否属于优质文章。
在上述过程中,不管是从文本角度来判断内容质量,还是使用文本与图片拼接后的特征来判断内容质量,对优质文章的挖掘准确率都有待提高。
发明内容
本申请实施例提供了一种数据处理方法、装置、计算机设备及存储介质,能够提高对文章类别的识别准确率,进而提高对优质文章的挖掘准确率。该技术方案如下:
一方面,提供了一种数据处理方法,由计算机设备执行,该方法包括:
获取文章的文本特征和图片特征,所述文本特征用于表征所述文章中的文本数据,所述图片特征用于表征所述文章中的图片数据;
对所述文本特征,基于所述图片特征中与所述文本特征关联的部分特征,确定第一交互特征,所述第一交互特征用于表征融合了图片特征的文本特征;
对所述图片特征,基于所述文本特征中与所述图片特征关联的部分特征,确定第二交互特征,所述第二交互特征用于表征融合了文本特征的图片特征;
将所述第一交互特征与所述第二交互特征融合,得到跨模态交互特征;
基于所述跨模态交互特征,确定所述文章所属的文章类别。
一方面,提供了一种数据处理装置,该装置包括:
第一获取模块,用于获取文章的文本特征和图片特征,所述文本特征用于表征所述文章中的文本数据,所述图片特征用于表征所述文章中的图片数据;
第二获取模块,用于对所述文本特征,基于所述图片特征中与所述文本特征关联的部分特征,确定第一交互特征,所述第一交互特征用于表征融合 了图片特征的文本特征;
第三获取模块,用于对所述图片特征,基于所述文本特征中与所述图片特征关联的部分特征,确定第二交互特征,所述第二交互特征用于表征融合了文本特征的图片特征;
融合模块,用于将所述第一交互特征与所述第二交互特征融合,得到跨模态交互特征;
确定模块,用于基于所述跨模态交互特征,确定所述文章所属的文章类别。
一方面,提供了一种计算机设备,该计算机设备包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条计算机程序,该至少一条计算机程序由该一个或多个处理器加载并执行以实现如上述任一种可能实现方式的数据处理方法。
一方面,提供了一种存储介质,该存储介质中存储有至少一条计算机程序,该至少一条计算机程序由处理器加载并执行以实现如上述任一种可能实现方式的数据处理方法。
一方面,提供一种计算机程序产品或计算机程序,所述计算机程序产品或所述计算机程序包括一条或多条程序代码,所述一条或多条程序代码存储在计算机可读存储介质中。计算机设备的一个或多个处理器能够从计算机可读存储介质中读取所述一条或多条程序代码,所述一个或多个处理器执行所述一条或多条程序代码,使得计算机设备能够执行上述任一种可能实施方式的数据处理方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过针对文章的文本数据和图片数据,分别提取文本特征和图片特征,并利用两者之间的跨模态交互特征,来预测该文章所属的文章类别,该方法同时考虑了文本模态和图片模态各自对于文章类别的贡献程度,而并非仅从文本角度来进行判断,此外所提取的跨模态交互特征并非是文本特征和图片特征的简单拼接,其能够反映出更加丰富和深层次的模态间交互信息,有助于提高对文章类别的识别准确率,进而在识别优质文章的场景下能够提高对优质文章的挖掘准确率。
附图说明
图1是本申请实施例提供的一种数据处理方法的实施环境示意图;
图2是本申请实施例提供的一种数据处理方法的流程图;
图3是本申请实施例提供的一种数据处理方法的流程图;
图4是本申请实施例提供的一种提取位置信息的原理性示意图;
图5是本申请实施例提供的一种跨模交互模型的原理性示意图;
图6是本申请实施例提供的一种结合相对位置编码的多模态融合网络的原理性示意图;
图7是本申请实施例提供的一种数据处理方法的流程图;
图8是本申请实施例提供的一种多模态融合网络的原理性示意图;
图9是本申请实施例提供的一种数据处理装置的结构示意图;
图10是本申请实施例提供的一种计算机设备的结构示意图;
图11是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
本申请中术语“至少一个”是指一个或多个,“多个”的含义是指两个或两个以上,例如,多个第一位置是指两个或两个以上的第一位置。
本申请实施例提供的方案涉及人工智能的机器学习等技术,尤其涉及多模态机器学习(Multi-Modal Machine Learning,MMML)技术,以下,对多模态机器学习的术语进行解释说明:
模态(Modality):每一种信息的来源或者形式,都可以被称为一种模态。例如,人有触觉,听觉,视觉,嗅觉;信息的媒介有语音、视频、文字等;多种多样的传感器,如雷达、红外、加速度计等,以上的每一种都可以称为一种模态。同时,模态也可以有非常广泛的定义,比如把两种不同的语言当做是两种模态,甚至在两种不同情况下采集到的数据集,亦可认为是两种模态。在本申请实施例中,对于一篇文章,可选地,划分为文本和图片两种模态,或者,划分为标题、正文和图片三种模态。
多模态机器学习:简称为多模态学习,旨在通过机器学习的方法实现处理和理解多源模态信息的功能。其中,单模态是指一种单一的模态,而多模态是指两种或者两种以上的模态以各种形式进行组合。目前比较热门的研究方向是图像、视频、音频、语义之间的多模态学习。整体来讲,多模态学习划分为以下几个研究方向:多模态表示学习、模态转化、对齐、多模态融合、协同学习等。单模态的表示学习负责将信息表示为计算机能够处理的数值向量,或者进一步抽象为更高层的特征向量,而多模态表示学习是指通过利用多模态之间的互补性,剔除模态间的冗余性,从而学习到更好的特征表示。
多模态融合(Multimodal Fusion):是多模态学习的一个研究方向,多模态融合负责联合多个模态的信息,进行目标预测(分类或者回归),属于MMML最早的研究方向之一,也是目前应用最广的方向,多模态融合还存在其他常见的别名,例如多源信息融合(Multi-source Information Fusion)、多传感器融合(Multi-sensor Fusion)等。在本申请实施例中,涉及对于一篇文章中的文本模态和图片模态的两模态融合,由于文本模态可以被划分为标题模 态和正文模态,因此还可以涉及标题模态、正文模态和图片模态的三模态融合。
图文优质:从文章内容本身的角度出发,检测内容质量与阅读体验兼顾的优质文章,能够帮助推荐侧更好的理解与应用内容中心出库的文章(即图文内容)。可选地,综合评价文章的内容质量时,可以从图文多模态融合、文章排版体验、账号原子特征等维度分别建模,最终完成优质文章的识别。
相对位置编码(Relative Position Embedding,RPE):是指Transformer(变换器)模型中的一种位置编码方式。Transformer模型的位置编码有两种方式:绝对位置编码和相对位置编码。绝对位置编码是目前普遍使用的位置编码方式,即直接对不同位置的字符随机初始化一个位置向量(Position Embedding),加到输入字符向量(Word Embedding)序列上输入模型,作为参数进行训练。使用绝对位置编码时,不同位置的字符对应的位置向量固然不同,但是关于不同位置上字符的相对含义,无法通过绝对位置编码显式获得,比如:位置1和位置2的距离比位置3和位置10的距离更近,位置1和位置2与位置3和位置4都只相差1,在使用绝对位置编码时只能隐式地学到位置之间的相对关系,而在本申请实施例中通过引入相对位置编码,能够增强位置之间的相对关系特征表示。
图1是本申请实施例提供的一种数据处理方法的实施环境示意图。参见图1,在该实施环境中包括:终端110和服务器120,终端110和服务器120均为计算机设备的一种示例。
终端110用于支持用户浏览各类包括图文内容的文章,例如,该文章包括但不限于:网页资讯、公众号推文、博客、微博等,本申请实施例不对文章的类型进行具体限定。终端110上安装和运行有支持浏览文章的应用程序,例如,该应用程序可以为浏览器应用、社交应用、图文资讯应用、新闻查看应用等,本申请实施例不对该应用程序的类型进行具体限定。示意性地,用户在终端110上启动应用程序,通过该应用程序能够浏览服务器120推送的优质文章。可选地,终端110可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。
终端110与服务器120之间可以通过有线或无线通信方式,进行直接或间接地连接,本申请在此不做限制。
服务器120用于识别并推送优质文章,即服务器120用于向终端110上安装的该应用程序提供后台服务。示意性地,服务器120收集平台内创作者发布的文章,提取文章中的标题、正文和图片,并根据对应的标题特征、正文特征和图片特征,判断对应的文章是否为优质文章,并在推荐阶段中针对所识别出的优质文章加大推荐权重,使得优质文章更有可能被推送至用户使用的终端110。
可选地,服务器120包括一台服务器、多台服务器、云计算平台或者虚拟化中心中的至少一种。例如,服务器120承担主要计算工作,终端110承担次要计算工作;或者,服务器120承担次要计算工作,终端110承担主要计算工作;或者,终端110和服务器120两者之间采用分布式计算架构进行协同计算。
在一些实施例中,服务器可以是独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式系统,或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)以及大数据和人工智能平台等基础云计算服务的云服务器。
本领域技术人员可以知晓,上述终端110的数量可以更多或更少。比如上述终端110可以仅为一个,或者上述终端110为几十个或几百个,或者更多数量。本申请实施例对终端110的数量和设备类型不加以限定。
图2是本申请实施例提供的一种数据处理方法的流程图。参见图2,该实施例由计算机设备执行,以计算机设备为服务器为例进行说明,该实施例包括下述步骤:
201、服务器获取文章的文本特征和图片特征,该文本特征用于表征该文章中的文本数据,该图片特征用于表征该文章中的图片数据。
其中,上述文章是指待判断文章类别的任一文章,该文章的类型包括但不限于:网页资讯、公众号推文、博客、微博等,本申请实施例不对文章的类型进行具体限定。
服务器是计算机设备的一种示例性说明,包括一台服务器、多台服务器、云计算平台或者虚拟化中心中的至少一种。本申请实施例,仅以计算机设备为服务器为例说明,即在服务器侧完成针对文章的类型识别,例如,识别文章是否为优质文章。在一些实施例中,针对文章的类型识别的步骤也能够部署在终端侧,例如,由终端独立识别文章是否为优质文章。
在一些实施例中,服务器获取文章,可选地,该文章可以为文章数据库中存储的文章,或者,该文章可以为终端最新向服务器上传的文章,或者,该文章可以是从分布式文件系统中下载的文章,本申请实施例不对待判别类别的文章的来源进行具体限定。
在识别优质文章的场景下,如果两篇文章的文本内容相似或文本质量相近,但两篇文章的图片质量不同,那么两篇文章的图片质量会对这两篇文章是否为优质文章的判定结果造成影响,可见,除了文本模态之外,图片模态即视觉模态也会影响优质文章的识别结果,因此在本申请实施例中除了文本模态之外,还引入了图片模态,以综合利用多模态的数据来精准识别优质文章。
在一些实施例中,获取到文章之后,由于文本和图片具有不同的特征,其特征提取方式也不尽相同,因此,服务器分别提取文章中的文本数据和图片数据。上述过程可视为针对文章中的多模态数据的提取过程,可选地,可以仅划分为文本和图片两个模态,能够降低多模态融合的计算复杂度。
在一些实施例中,如果仅划分为文本和图片两个模态,那么服务器可以分别提取文本数据的文本语义特征和图片数据的图片深度特征,将文本语义特征与文本位置特征融合得到文本特征,将图片深度特征与图片位置特征融合得到图片特征。
在上述过程中,针对文本和图片这两种不同模态的数据,可以采用不同的方式有针对性地进行特征提取,从而得到文本特征和图片特征。并且由于文本特征和图片特征中各自融合了位置的相关信息,因此当相同文本或者图片出现在文章中的不同位置时可以表现出不同的特征,从而提高了文本特征和图片特征各自的表达能力。
在一些实施例中,由于文本中的标题和正文也具有不同的特征,因此服务器可以在提取文本数据时,进一步将标题数据和正文数据分开进行提取,以引入更多、更丰富的特征信息,即划分为标题、正文和图片三个模态,以提高整体文章类别的识别准确率。
在一些实施例中,如果划分为标题、正文和图片三个模态,那么服务器分别提取标题数据的标题语义特征、正文数据的正文语义特征和图片数据的图片深度特征,接着,将标题语义特征与标题位置特征融合得到标题特征,将正文语义特征与正文位置特征融合得到正文特征,将图片深度特征与图片位置特征融合得到图片特征。
在上述过程中,针对标题、正文、图片这三种不同模态的数据,采用不同的方式有针对性地进行特征提取,从而得到标题特征、正文特征和图片特征。并且还在标题特征、正文特征和图片特征中各自融合了位置的相关信息,使得当相同标题、正文或者图片出现在文章中的不同位置时表现出不同的特征,从而提高了标题特征、正文特征和图片特征各自的表达能力。
在一些实施例中,上述文本位置特征、图片位置特征、标题位置特征、正文位置特征等位置特征,可以均是采用绝对位置编码方式得到的绝对位置特征,能够简化多模态融合的训练流程。
在一些实施例中,上述文本位置特征、图片位置特征、标题位置特征、正文位置特征等位置特征,可以均是采用相对位置编码方式得到的相对位置特征,该相对位置特征用于表征对应的文本数据与图片数据之间的先后顺序和距离远近,或者表征对应的标题数据、正文数据与图片数据之间的先后顺序和距离远近。在下个实施例中,将对相对位置特征的相对位置编码方式进行详述,这里不做赘述。
202、对该文本特征,服务器基于该图片特征中与该文本特征关联的部分 特征,确定第一交互特征,该第一交互特征用于表征融合了图片特征的文本特征。
在一些实施例中,针对文本和图片两模态融合的情况,服务器可以以文本模态为主模态,接收图片模态的辅助信息,例如,获取图片特征中与该文本特征关联的部分特征,并在此基础上获取到该第一交互特征。
在一些实施例中,针对标题、正文和图片三模态融合的情况,第一交互特征包括标题交互特征和正文交互特征。可选地,服务器可以以标题模态为主模态,分别接收正文模态和图片模态各自的辅助信息,例如,分别获取正文特征和图片特征中与标题特征关联的部分特征,并在此基础上获取到标题交互特征;并且,以正文模态为主模态,分别接收标题模态和图片模态各自的辅助信息,例如,分别获取标题特征和图片特征中与正文特征关联的部分特征,并在此基础上获取到正文交互特征。
203、对该图片特征,服务器基于该文本特征中与该图片特征关联的部分特征,确定第二交互特征,该第二交互特征用于表征融合了文本特征的图片特征。
在一些实施例中,针对文本和图片两模态融合的情况,服务器可以以图片模态为主模态,接收文本模态的辅助信息,例如,获取文本特征中与该图片特征关联的部分特征,并在此基础上获取到该第二交互特征。
在一些实施例中,针对标题、正文和图片三模态融合的情况,服务器以图片模态为主模态,分别接收标题模态和正文模态各自的辅助信息,例如,分别获取标题特征和正文特征中与图片特征关联的部分特征,并在此基础上获取到第二交互特征。
204、服务器将该第一交互特征与该第二交互特征融合,得到跨模态交互特征。
在一些实施例中,上述步骤204获取到的该跨模态交互特征,是指将多模态数据各自参考了其他模态数据的特征进行融合所得的信息,使得跨模态交互特征能够强化文本数据与图片数据之间的关联关系,其中,针对多模态中的每个模态,都会作为主模态接收其他模态的辅助,得到一个跨模态特征,将每个模态对应的跨模态特征融合,得到最终的跨模态交互特征。
针对文本和图片两模态融合的情况,服务器可以直接将第一交互特征和第二交互特征融合,得到最终的跨模态交互特征。
在上述过程中,对文章来说,文本模态和图片模态的表现方式不一样,因此文本特征和图片特征之间可能会存在交叉(即信息冗余)或者互补(即比单模态特征蕴含更多信息)的现象,概括来说,多模态数据的显著特点是冗余性和互补性,甚至模态间还可能存在多种不同的信息交互,因此通过以文本模态为主模态提取第一交互特征、以图片模态为主模态提取第二交互特征,能够合理处理文章中的多模态数据,得到更加丰富的交互特征。
针对标题、正文和图片三模态融合的情况,服务器可以将标题交互特征、正文交互特征和第二交互特征融合,得到最终的跨模态交互特征。
在上述过程中,对文章来说,划分了标题、正文和图片共三种模态,并且两两组合进行有向的跨模态注意力交互,每个模态都会作为主模态,接收另外两个模态的辅助信息,最终融合得到跨模态交互特征,由于比两模态融合引入了更多的模态交互信息,因此更有助于提升优质文章的识别准确率。
205、服务器基于该跨模态交互特征,确定该文章所属的文章类别。
在一些实施例中,服务器对该跨模态交互特征进行全连接处理,得到全连接特征;对该全连接特征进行指数归一化,得到该文章的概率预测结果,该概率预测结果中包括多个预测概率,这多个预测概率与多个类别一一对应,即该概率预测结果表征文章属于多个类别的多个预测概率;进而,确定符合目标条件的预测概率对应的类别,为该文章所属的文章类别。
可选地,服务器将该跨模态交互特征输入到一个全连接层或者全连接网络中,输出该全连接特征,接着利用指数归一化Softmax函数对该全连接特征进行映射,得到文章属于每个类别的预测概率。进一步地,从所有的预测概率中,选择符合目标条件的预测概率,将该符合目标条件的预测概率对应的类别确定为该文章所属的文章类别。
在一些实施例中,该目标条件可以为预测概率最大,那么服务器可以从该多个预测概率中确定最大预测概率,将该最大预测概率对应的类别确定为该文章所属的文章类别。或者,服务器可以按照从大到小的顺序对该多个预测概率进行排序,选择排序位于第一位的预测概率对应的类别为该文章所属的文章类别。
在一些实施例中,该目标条件可以为预测概率大于概率阈值,那么服务器可以从该多个预测概率中,确定大于概率阈值的各个预测概率,从大于该概率阈值的各个预测概率对应的各个类别中随机选择一个类别作为该文章所属的文章类别。其中,该预测概率为任一大于或等于0且小于或等于1的数值。
在一些实施例中,该目标条件可以为预测概率topK(K≥1)随机选择,那么服务器可以按照从大到小的顺序对该多个预测概率进行排序,选择排序位于前K位的K个预测概率,并从该K个预测概率对应的K个类别中随机选择一个类别作为该文章所属的文章类别。其中,K为大于或等于1的整数。
在上述过程中,基于多模态融合方式结合相对位置编码,能够识别出文章所属的文章类别,可选地,文章类别可以是按照文章是否为优质文章划分的,例如划分为:优质文章、非优质文章等;可选地,文章类别也可以是按照文章的主要内容所属的领域划分的,例如划分为:财经类、娱乐类、新闻类、科普类等,本申请实施例不对文章类别的划分方式进行具体限定。
在一个示例性场景中,文章类别是按照文章是否为优质文章划分的,那 么能够适用于识别优质文章,即识别优质图文内容的场景中,例如,文章类别划分为:优质文章和非优质文章,或者,文章类别划分为:优质文章、普通文章和低质文章等等,本申请实施例不对文章类别的划分方式进行具体限定。在优质文章识别场景下,同一篇文章内的相邻图文(即位置相近的文本和图片)之间的跨模交互信息是至关重要的,由于文章内通常文本是以字符序列或者语句序列的方式表示,且图片也能够按照先后顺序排列为图片序列,因此文本模态和图片模态能够达到序列级交互,从而通过构建序列级的多模态融合网络,能够在文本和图片各自的特征非对齐的情况下,充分利用模态间的序列级交互信息,提取到多模态之间的交互特征,并投入到文章类别的预测过程中,以提升文章类别的识别准确率。
在一个示例性场景中,文章类别是按照文章的主要内容所属的领域划分的,那么能够适用于按照用户画像进行精准推送的场景中,例如,文章类别划分为:财经类、娱乐类、新闻类、科普类等,在这一应用场景下,可选地,服务器基于文章的跨模态交互特征与用户的用户特征之间的相似度,来确定是否向该用户推荐该文章,从而能够向用户推荐符合用户长期偏好的文章,或者,服务器基于文章的跨模态交互特征与用户的历史阅读平均特征之间的相似度,来确定是否向该用户推荐该文章,其中,该历史阅读平均特征是指用户最近一周内(或一个月、两个月等指定时间段内)阅读的历史文章的跨模态交互特征的平均特征,从而能够向用户推荐符合其近期偏好的文章。
上述所有可选技术方案,能够采用任意结合形成本公开的可选实施例,在此不再一一赘述。
本申请实施例提供的方法,通过针对文章的文本数据和图片数据,分别提取文本特征和图片特征,并利用两者之间的跨模态交互特征,来预测该文章所属的文章类别,该方法同时考虑了文本模态和图片模态各自对于文章类别的贡献程度,而并非仅从文本角度来进行判断,此外所提取的跨模态交互特征并非是文本特征和图片特征的简单拼接,其能够反映出更加丰富和深层次的模态间交互信息,有助于提高对文章类别的识别准确率,进而在识别优质文章的场景下能够提高对优质文章的挖掘准确率。
图3是本申请实施例提供的一种数据处理方法的流程图。参见图3,该实施例由计算机设备,以计算机设备为服务器为例进行说明,针对目标文章仅划分文本和图片两个模态的情况,在本申请实施例中将详细介绍如何基于两模态融合方式识别文章的文章类别,该实施例包括下述步骤:
301、服务器获取文章中的文本数据和图片数据。
其中,文章是指待判断文章类别的任一文章,该文章的类型包括但不限于:网页资讯、公众号推文、博客、微博等,本申请实施例不对文章的类型进行具体限定。
在一些实施例中,服务器获取文章,可选地,该文章为文章数据库中存储的文章,或者,该文章为终端最新向服务器上传的文章,或者,该文章是从分布式文件系统中下载的文章,本申请实施例不对文章的来源进行具体限定。
在一些实施例中,由于文本和图片具有不同的特征,其特征提取方式也不尽相同,因此,服务器分别提取文章中的文本数据和图片数据。上述过程可视为针对文章中的多模态数据的提取过程,可选地,仅划分为文本和图片两个模态,能够降低多模态融合的计算复杂度。
302、服务器提取该文本数据的文本语义特征,将该文本语义特征与文本位置特征融合,得到该文本数据的文本特征。
在一些实施例中,服务器可以基于文本编码模型来提取该文本语义特征,该文本编码模型用于提取文本数据的文本语义特征,也即是说,服务器将该文本数据输入到文本编码模型中,通过该文本编码模型对该文本数据进行编码,以得到该文本语义特征。
可选地,该文本编码模型的模型结构包括但不限于下述任一项或者至少两项的组合:BERT(Bidirectional Encoder Representation From Transformers,采用双向编码表示的翻译模型)、Transformers(变换器,一种经典的翻译模型)、ELMo(Embeddings From Language Models,采用嵌入处理的语言模型)、NNLM(Neural Network Language Model,神经网络语言模型)等,本申请实施例不对该文本编码模型的模型结构进行具体限定。例如,该文本编码模型为BERT模型,以降低特征提取过程的计算复杂度,又例如,该文本编码模型由BERT模型与Transformers模型的编码器(Encoder)级联而成。
示意性地,以该文本编码模型由BERT模型与Transformers模型的编码器级联而成为例说明,假设该文本数据包括至少一个语句,服务器对每个语句进行分词处理,得到每个语句中包括的至少一个字符,将各个语句的各个字符按照其在文章中出现的先后顺序可排列形成一个字符序列,在该字符序列中在每个语句的句尾添加[SEP]作为语句分割符,并在该字符序列的首位增加[CLS]作为分类符,其中,语句分割符用于在相邻的语句之间进行断句,分类符用于表征整个字符序列的全局化语义信息。
将该字符序列输入到BERT模型中,BERT模型包括一个嵌入(Embedding)层和至少一个双向编码层,每个双向编码层用于对输入信号进行正向编码和反向编码,每个双向编码层的输出作为下一个双向编码层的输入,即各个双向编码层之间串联连接。每个双向编码层中包括两部分,一部分是注意力网络,另一部分是前向全连接层,注意力网络中每一个隐层都是由上一层的隐层进行加权平均所得,使得每一个隐层都能和上一层的所有隐层直接关联,利用输入的长序列信息(也即该字符序列)能够得到一个用于表征全局化信息的隐层向量,而前向全连接层则用于对注意力网络确定的全局化信息进行 进一步加工,以增强整个BERT模型的学习能力。
可选地,先将该字符序列输入到BERT模型的嵌入层中,通过该嵌入层对该字符序列中各个字符进行嵌入处理,即将各个字符映射到嵌入空间,得到各个字符的嵌入向量,即得到了一个嵌入向量序列。接着,再将该嵌入向量序列输入到该至少一个双向编码层中,通过该至少一个双向编码层对该嵌入向量序列中各个嵌入向量进行双向编码(包括正向编码和反向编码),输出各个字符的语义向量,即得到了一个语义向量序列。其中,该字符序列中的每个字符对应于该嵌入向量序列中的一个嵌入向量,该嵌入向量序列中的每个嵌入向量对应于该语义向量序列中的一个语义向量。
在上述过程中,通过双向编码层对该嵌入向量序列分别进行正向编码和反向编码,通过正向编码使得每个字符对应的语义向量能够融合该字符之前出现的字符的相关信息,通过反向编码使得每个字符对应的语义向量能够融合该字符之后出现的字符的相关信息,两个方向的编码操作能够大大提升各个字符的语义向量的表达能力。
以第一个双向编码层为例进行说明,在该双向编码层中包括注意力网络和前向全连接层。将该嵌入向量序列输入到第一个双向编码层的注意力网络中,通过注意力网络对该嵌入向量序列进行加权,以提取该嵌入向量序列的注意力特征序列,将该注意力特征序列输入到第一个双向编码层的前向全连接层中,通过前向全连接层对该注意力特征序列进行双向的语义编码(包括正向编码和反向编码),输出一个隐向量序列,将该隐向量序列输入到第二个双向编码层中,依此类推,后续的双向编码层的处理逻辑均与第一个双向编码层类似,这里不做赘述。由于在双向编码层中引入注意力机制,因此能够在每次进行语义编码时,使得各个字符聚焦于与自身关联较大(关系更密切)的字符,使得最终获取的各个字符的语义向量具有更高的准确性。
接着,将BERT模型中最后一个双向编码层输出的语义向量序列输入到Transformers模型的编码器中,Transformers模型中包括多个级联的编码器,例如,包括N(N≥1)个级联的编码器,如N=6或者其他数量,本申请实施例对此不进行具体限定。每个编码器内部又包括一个多头注意力(Multi-Head Attention)层和一个前馈神经网络(FeedForward Neural Network)层,多头注意力层用于从多个表达子空间中综合提取字符序列内各字符之间的关联关系,前馈神经网络层用于对多头注意力层输出的特征向量进行全连接,在多头注意力层和前馈神经网络层之后均设置有残差结构,也即将当前层的输入与输出进行残差连接(即拼接),并归一化之后再输入到下一层中。通过Transformers模型的多个编码器对输入的该语义向量序列进行编码,由最后一个编码器输出该文本数据的文本语义特征。
在上述过程中,通过由BERT模型与Transformers模型的编码器级联而成的文本编码模型,能够提取出具有较强表达能力的文本语义特征,可选地, 也可以仅利用BERT模型来提取该文本语义特征,或者仅利用Transformers模型的编码器来提取该文本语义特征,以降低提取该文本语义特征时的计算复杂度,本申请实施例对此不进行具体限定。
在一些实施例中,服务器还可以获取该文本数据的文本位置特征,该文本位置特征用于表征各个字符在文本数据中的位置先后顺序。可选地,对该字符序列中各个字符的位置信息进行编码,得到该文本数据的文本位置特征。接着,将该文本语义特征和文本位置特征进行拼接(Concat),得到该文本数据的文本特征。
在一些实施例中,服务器在对各个字符的位置信息进行编码时,可以采用绝对位置编码方式或者相对位置编码方式,本申请实施例对位置信息的编码方式不进行具体限定。示意性地,在下述步骤303中将以使用相对位置编码方式为例进行说明,这里不做赘述。
在一些实施例中,如果该文本语义特征与文本位置特征的维度不同,那么文本语义特征和文本位置特征将无法直接拼接,此时可以使用一个1维卷积层对该文本语义特征进行维度变换(即升维或者降维),使得维度变换后的文本语义特征与文本位置特征维度相同,从而将维度变换后的文本语义特征与文本位置特征拼接,得到该文本数据的文本特征。其中,该1维卷积层是指卷积核尺寸为1×1的卷积层。
在一些实施例中,除了以拼接方式进行融合之外,也可采用按元素相加、按元素相乘、双线性汇合等方式,来融合该文本语义特征和文本位置特征,本申请实施例不对融合方式进行具体限定。
303、服务器提取该图片数据的图片深度特征,将该图片深度特征与图片位置特征融合,得到该图片数据的图片特征。
在一些实施例中,服务器可以基于图片深度模型来提取该图片深度特征,该图片深度模型用于提取图片数据的图片深度特征,也即是说,服务器将该图片数据输入到图片深度模型中,通过该图片深度模型对该图片数据进行卷积处理,以提取得到该图片深度特征。可选地,该图片深度模型包括但不限于:卷积神经网络(Convolutional Neural Networks,CNN)、深度残差网络(ResNet)、MobileNet(一种轻量级神经网络)等,本申请实施例不对该图片深度模型的模型结构进行具体限定。
示意性地,该图片深度模型可以为MobileNet模型,MobileNet模型是指将VGG(Visual Geometry Group,视觉几何组)模型中的标准卷积层换成深度可分离卷积(Depthwise Separable Convolution)层,深度可分离卷积是一种可分解卷积操作,能够分解为:深度卷积(Depthwise Convolution)和逐点卷积(Pointwise Convolution),其中,深度卷积和标准卷积不同,标准卷积的卷积核用在输入特征图的所有通道上,而深度卷积针对每个输入通道采用不同的卷积核,即一个卷积核对应一个输入通道,而逐点卷积就是1维卷积, 即采用的卷积核的尺寸为1×1的标准卷积。通过使用深度可分离卷积,能够大大减少图片深度模型的计算量和参数量。
在一些实施例中,MobileNet模型包括一个3×3的标准卷积层、堆积的多个深度可分离卷积层、一个均值池化层和一个全连接层,在标准卷积层后可进行下采样再输入到下一层中,同理,在部分深度可分离卷积层后可进行下采样再输入到下一层中,该均值池化层用于将最后一个深度可分离卷积层输出的特征图进行均值池化,该全连接层用于对均值池化层输出的特征图进行全连接。示意性地,MobileNet模型总共包括有28层,其中深度可分离卷积层有13层。
在一些实施例中,由于文章中通常包括多个图片,因此该多个图片可构成一个图片序列,服务器将该图片序列输入到MobileNet模型中,通过标准卷积层对该图片序列进行标准卷积操作,得到第一特征图,将该第一特征图输入到级联的多个深度可分离卷积层中,每个深度可分离卷积层对上一层输出的特征图进行深度可分离卷积操作,最后一个深度可分离卷积层输出第二特征图,将该第二特征图输入到均值池化层中,通过该均值池化层对第二特征图进行均值池化操作,得到第三特征图,将该第三特征图输入到全连接层中,通过该全连接层对该第三特征图进行全连接,得到该图片深度特征。
在一些实施例中,服务器还可以获取该图片数据的图片位置特征,该图片位置特征用于表征各个图片在图片数据中的位置先后顺序。可选地,对该图片序列中各个图片的位置信息进行编码,得到该图片数据的图片位置特征。接着,将该图片深度特征和图片位置特征拼接(Concat),得到该图片数据的图片特征。
在一些实施例中,服务器在对各个图片的位置信息进行编码时,可以采用绝对位置编码方式或者相对位置编码方式,本申请实施例对位置信息的编码方式不进行具体限定。
示意性地,以使用相对位置编码方式为例进行说明,在这种情况下,该文本位置特征和该图片位置特征均为该文本数据与该图片数据之间的相对位置特征,该相对位置特征用于表征该文本数据与该图片数据之间的先后顺序和距离远近。
在一些实施例中,该相对位置特征的获取方式包括:确定文本数据中的多个文本、以及图片数据中的多个图片各自在该文章中的位置信息;基于该位置信息,构建相对位置编码矩阵,该相对位置编码矩阵中的任一元素用于表征该元素所属列对应的文本和该元素所属行对应的图片之间的相对位置信息;基于该相对位置编码矩阵,确定该多个文本中的任一文本与该多个图片中的任一图片之间的相对位置特征。
图4是本申请实施例提供的一种提取位置信息的原理性示意图,如400所示,假设文章中包括5段文本和4张图片,其中,5段文本所构成的文本 序列的位置编号为{1,3,5,7,8},4张图片所构成的图片序列的位置编号为{0,2,4,6},那么针对文章提取到的绝对位置关系可以表示为:pos-0(img)、pos-1(text)、pos-2(img)、pos-3(text)、pos-4(img)、pos-5(text)、pos-6(img)、pos-7(text)、pos-8(text)。示意性地,基于上述绝对位置关系,以文本序列代表矩阵列,以图片序列代表矩阵行,构建出如下表1所示的相对位置编码矩阵:
表1
  1 3 5 7 8
0 1 3 5 7 8
2 -1 1 3 5 6
4 -3 -1 1 3 4
6 -5 -3 -1 1 2
其中,相对位置编码矩阵中的每个元素用于表征该元素所属列对应的文本和该元素所属行对应的图片之间的相对位置信息。例如,相对位置编码矩阵中第2行第3列的元素“3”代表了第3列所对应的文本“5”与第2行所对应的图片“2”之间的相对位置信息:3=5-2。
在构建出相对位置编码矩阵之后,通过该相对位置编码矩阵,可以确定出每个文本与每个图片之间的相对位置信息,对该相对位置信息进行编码即可得到对应的相对位置特征。
在上述过程中,通过使用相对位置特征,可以在文本特征和图片特征均引入显式地相对位置信息,从而能够提升文本特征和图片特征各自的表达能力。
在一些实施例中,使用传统的绝对位置编码方式确定的文本位置特征和图片位置特征均属于绝对位置特征,能够隐式的学习到不同文本段落和不同文章插图在位置上的相关性,充分考虑文本序列和图片序列的模态内部位置关系。
在一些实施例中,如果该图片深度特征与图片位置特征的维度不同,那么图片深度特征和图片位置特征将无法直接拼接,此时可以使用一个1维卷积层对该图片深度特征进行维度变换(即升维或者降维),使得维度变换后的图片深度特征与图片位置特征维度相同,进而将维度变换后的图片深度特征与图片位置特征拼接,得到该图片数据的图片特征。其中,该1维卷积层是指卷积核尺寸为1×1的卷积层。
在一些实施例中,除了可以采用拼接方式融合之外,也可采用按元素相加、按元素相乘、双线性汇合等方式,来融合该图片深度特征和图片位置特征,本申请实施例不对融合方式进行具体限定。
在上述步骤302-303中,提供了在文本与图片两模态融合的情况下,服务器获取该文章的文本特征和图片特征的可能实施方式,其中,该文本特征 用于表征该文章中的文本数据,该图片特征用于表征该文章中的图片数据。在下一个实施例中将介绍在标题、正文和图片三模态融合的情况下,服务器如何获取标题数据的标题特征、正文数据的正文特征和图片数据的图片特征,这里不做赘述。
应理解,在实际应用中,可以先执行步骤302、后执行步骤303,也可以先执行步骤303、后执行步骤302,还可以同时执行步骤302和步骤303,本申请在此不对步骤302和步骤303的执行顺序做任何限定。
304、对该文本特征,服务器基于该图片特征中与该文本特征关联的部分特征,确定第一交互特征,该第一交互特征用于表征融合了图片特征的文本特征。
在一些实施例中,以文本模态为主模态,接收图片模态的辅助信息,也即是说,服务器获取图片特征中与文本特征关联的部分特征,并利用跨模交互模型,对该文本特征和该部分特征进行处理,得到该第一交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者Transformers模型的变种。
图5是本申请实施例提供的一种跨模交互模型的原理性示意图,如500所示,以跨模交互模型为Cross-modal(跨模)Transformers模型为例进行说明,跨模Transformers模型包括D+1(D≥0)个跨模交互层,假设α模态为主模态(例如文本模态),β模态为辅模态(例如图片模态),那么从β模态到α模态(β→α)的交互特征的提取过程如下:
向跨模Transformers模型的第0层输入α模态在t=0时刻下的特征
Figure PCTCN2022111609-appb-000001
和β模态在t=0时刻下的特征
Figure PCTCN2022111609-appb-000002
由于跨模Transformers模型中每一层对输入的两个模态的特征的处理是类似的,因此,以第i层的处理逻辑为例进行说明,其中i为大于或等于0且小于或等于D+1的任一整数。
第i层的输入信号包括:从β模态到α模态在t=i-1时刻下的交互特征
Figure PCTCN2022111609-appb-000003
和β模态在t=0时刻下的特征
Figure PCTCN2022111609-appb-000004
将特征
Figure PCTCN2022111609-appb-000005
输入到层级归一化(LayerNorm,LN)层中进行归一化,得到特征Q α;同理,将特征
Figure PCTCN2022111609-appb-000006
输入到另一个LN层中进行归一化,得到特征K β和V β。将上述特征Q α、K β和V β输入到多头注意力(Multi-Head)层,通过该多头注意力层,对输入的特征Q α、K β和V β基于注意力机制进行加权,提取到输入信号之间的跨模特征
Figure PCTCN2022111609-appb-000007
将跨模特征
Figure PCTCN2022111609-appb-000008
和输入信号中的特征Q α进行按元素相加(Addition),得到一个融合特征。将该融合特征输入到另一个LN层中进行归一化,将归一化后的融合特征输入到一个位置全连接前馈(Positionwise Feed-forward)网络层中进行全连接,将该位置全连接前馈网 络层输出的特征与输入该位置全连接前馈网络层的该融合特征按元素相加,得到第i层的输出特征,即从β模态到α模态在t=i时刻下的交互特征
Figure PCTCN2022111609-appb-000009
以此类推,最终由第D层输出最终的交互特征
Figure PCTCN2022111609-appb-000010
(即第一交互特征)。其中,两次按元素相加操作相当于分别对多头注意力层和位置全连接前馈网络层各自的输入和输出进行残差连接。
在上述过程中,跨模Transformers模型接收主模态和辅模态各自的序列特征作为输入信号,经过多头注意力层和位置全连接前馈网络层这两级的处理,最终输出融合了辅模态信息的主模态表征(即第一交互特征),该多头注意力层可视为对自注意力(Self-Attention)层的改造,将输入特征的K和V改为辅模态β的特征序列K β和V β,而特征Q则是主模态α的特征序列Q α,利用主模态来挑选辅模态中与自身存在交互关系的辅模态信息,使得提取到的第一交互特征具有更强的特征表达能力。进一步地,跨模Transformers模型的结构具有通用性和灵活性,在模型设计时可根据模态的重要性进行定制化组合,并且,多头注意力层中跨模态的注意力机制是有向的,即对于同一对输入模态{文本,图片}来说,以文本为主模态和以图片为主模态所提取到的交互特征是不同的,比如本步骤304以文本为主模态时提取到第一交互特征,下述步骤305以图片为主模态时提取到第二交互特征,该第一交互特征与第二交互特征是不同的,这样有助于模型更加充分地利用模态间的交互信息,此外,跨模Transformers模型中利用多个跨模交互层的堆叠,相较于传统的单层交互方案可以融合更多的高阶交互信息。
在上述过程中,直接由跨模Transformers模型输出第一交互特征,能够降低获取第一交互特征时的计算复杂度。在一些实施例中,还可以将跨模Transformers模型输出的特征作为中间交互特征,进而再将该中间交互特征输入一个基础的Transformers模型中进行先编码再解码,最终由基础的Transformers模型输出该第一交互特征。
可选地,服务器将该中间交互特征输入Transformers模型,Transformers模型包括N个级联的编码器和N个级联的解码器,调用该N个级联的编码器对该中间交互特征进行编码,将编码得到的特征输入到N个级联的解码器中进行解码,得到该第一交互特征。其中,N为大于或等于1的整数,例如N=6或者其他数值。
在一些实施例中,N个级联的编码器中每个编码器内部包括一个多头注意力层和一个前馈神经网络层,多头注意力层用于从多个表达子空间中综合提取各时刻下的特征向量之间的关联关系,前馈神经网络层用于对多头注意力层输出的特征向量进行全连接,在多头注意力层和前馈神经网络层之后均设置有残差结构,也即将当前层的输入与输出进行残差连接(即拼接)并归一化之后再输入到下一层中。通过N个级联的编码器对输入的向量进行编码, 将最后一个编码器输出的特征输入到N个级联的解码器中。
N个级联的解码器中每个解码器内部包括一个掩码多头注意力层、一个融合多头注意力层和一个前馈神经网络层,掩码多头注意力层与多头注意力层类似,但掩码多头注意力层仅关注当前时刻之前的翻译结果,因此需要对当前时刻之后的翻译结果进行mask(遮挡)处理,而融合多头注意力层也与多头注意力层也类似,但融合多头注意力层除了以本解码器的掩码多头注意力层的输出为输入之外,还以对应序号的编码器的前馈神经网络层的输出(指经过残差连接及归一化的结果)作为输入,这一设计是为了关注编码器的编码信息,换一种说法,解码器通过查看编码器的输出和对其自身输出的自注意力,来预测下一个时刻的交互特征,解码器的前馈神经网络层与编码器的前馈神经网络层类似,这里不做赘述,同理解码器的掩码多头注意力层、融合多头注意力层、前馈神经网络层之后也均设置有残差结构,也即将当前层的输入与输出进行残差连接(即拼接)并归一化之后再输入到下一层中。其中,级联的编码器的数量与级联的解码器的数量需要保持一致。通过N个级联的解码器可以对编码得到的特征进行解码,由最后一个解码器输出第一交互特征。
305、对该图片特征,服务器基于该文本特征中与该图片特征关联的部分特征,确定第二交互特征,该第二交互特征用于表征融合了文本特征的图片特征。
在一些实施例中,以图片模态为主模态,接收文本模态的辅助信息,也即是说,服务器获取文本特征中与图片特征关联的部分特征,并利用跨模交互模型,对该图片特征和该部分特征进行处理,得到该第二交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者Transformers模型的变种。
上述步骤305与上述步骤304类似,只是将主模态α变更为图片模态,将辅模态β变更为文本模态,这里不做赘述。
在上述过程中,直接由跨模Transformers模型输出第二交互特征,能够降低获取第二交互特征时的计算复杂度。在一些实施例中,将跨模Transformers模型输出的特征作为中间交互特征,将该中间交互特征输入一个基础的Transformers模型中进行先编码再解码,最终由基础的Transformers模型输出该第二交互特征。可选地,服务器将该中间交互特征输入Transformers模型,Transformers模型包括N个级联的编码器和N个级联的解码器,调用该N个级联的编码器对该中间交互特征进行编码,将编码得到的特征输入到N个级联的解码器中进行解码,得到该第一交互特征。其中,N为大于或等于1的整数,例如N=6或者其他数值。基础Transformers模型中每个编码器和解码器的内部处理逻辑已在上述步骤304中介绍过,这里不做赘述。
应理解,在实际应用中,可以先执行步骤304、后执行步骤305,也可以先执行步骤305、后执行步骤304,还可以同时执行步骤304和步骤305,本申请在此不对步骤304和步骤305的执行顺序做任何限定。
306、服务器将该第一交互特征与该第二交互特征融合,得到跨模态交互特征。
在一些实施例中,服务器可以将该第一交互特征和该第二交互特征进行拼接,得到最终的跨模态交互特征,从而降低特征融合时的计算量。
在另一些实施例中,服务器可以将该第一交互特征和该第二交互特征进行按元素相加、按元素相乘或者双线性汇合,从而使得特征融合得更加充分,本申请实施例不对特征融合方式进行具体限定。
在上述步骤304-306中,示出了在文本与图片两模态融合的情况下,服务器获取两个模态之间的跨模态交互特征的一种可能实施方式,在下个实施例中将针对标题、正文和图片三模态融合的情况,介绍服务器如何获取三个模态之间的跨模态交互特征,这里不做赘述。
307、服务器基于该跨模态交互特征,
确定该文章所属的文章类别。
上述步骤307与上述步骤205类似,这里不做赘述。
图6是本申请实施例提供的一种结合相对位置编码的多模态融合网络的原理性示意图,如图6所示,多模态融合网络中包括文本编码模型601、图片编码模型602和跨模交互部分603。
示意性地,文本编码模型601可以由基础BERT模型微调(Finetune)得到的BERT模型6011和Transformers模型的编码器6012级联而成,将文本数据的字符序列(简称为文本句子序列)输入BERT模型6011,输出一个语义向量序列,将该语义向量序列再输入到Transformers模型的编码器6012,输出文本数据的文本语义特征,将文本语义特征输入一个1维卷积层(Conv1D)层进行维度变换后,与文本位置特征进行拼接,得到文本数据的文本特征。
示意性地,图片编码模型602为预训练得到的MobileNet模型,将图片数据的图片序列输入图片编码模型602,输出图片数据的图片深度特征,将图片深度特征输入Conv1D层进行维度变换后,与图片位置特征进行拼接,得到该图片数据的图片特征。
示意性地,跨模交互部分603包括2个跨模Transformers模型和2个基础Transformers模型。以文本模态为主模态,利用跨模Transformers模型提取从图片模态→文本模态的中间交互特征,将该中间交互特征输入基础Transformers模型进行先编码再解码,输出第一交互特征。以图片模态为主模态,利用跨模Transformers模型提取从文本模态→图片模态的中间交互特征,将该中间交互特征输入基础Transformers模型进行先编码再解码,输出第二交互特征。
进一步地,将该第一交互特征和该第二交互特征拼接,得到最终两模态间的跨模态交互特征,再利用该跨模态交互特征,预测得到文章最终所属的文章类别(Classification)。
在一些实施例中,如采用上述步骤303中介绍的相对位置编码方式,那么就需要将跨模交互部分603中各个Transformers模型的绝对位置特征修改为相对位置特征,例如,分离原始的字符Embedding(嵌入向量)和位置Embedding(位置向量),展开分列式后,将绝对位置编码方式的位置向量转换为相对位置编码方式的位置向量,即实现在任意两模态进行交互计算时将相对位置关系融入到自注意力层中。
在Transformers模型中,自注意力层通常表示为:
Figure PCTCN2022111609-appb-000011
其中,Attention(Q,K,V)是指基于Q(Query)矩阵、K(Key)矩阵和V(Value)矩阵计算得到的注意力系数,softmax()是指指数归一化函数,Q是指当前字符的Q矩阵,K是指当前字符的K矩阵,V是指当前字符的V矩阵,K T是指K矩阵的转置矩阵,
Figure PCTCN2022111609-appb-000012
是指缩放因子。
使用乘法分配律展开,绝对位置编码方式的情况下,模态1中的第i个元素和模态2中的第j个元素之间的注意力系数
Figure PCTCN2022111609-appb-000013
的展开式如下:
Figure PCTCN2022111609-appb-000014
其中,E表示文本向量,U指位置向量,W指参数矩阵。
也即是说,
Figure PCTCN2022111609-appb-000015
表示模态1中第i个元素的文本向量的转置矩阵,
Figure PCTCN2022111609-appb-000016
表示Q矩阵的参数矩阵的转置矩阵,W k表示K矩阵的参数矩阵,
Figure PCTCN2022111609-appb-000017
表示模态2中第j个元素的文本向量,U j表示模态2中第j个元素的位置向量,
Figure PCTCN2022111609-appb-000018
表示模态1中第i个元素的位置向量的转置矩阵。
使用相对位置编码方式的位置向量R i-j来替代绝对位置编码方式的位置向量U j和U i,上述展开式即可变换为:
Figure PCTCN2022111609-appb-000019
其中,E表示文本向量,U指位置向量,W指参数矩阵。
也即是说,
Figure PCTCN2022111609-appb-000020
表示模态1中第i个元素的文本向量的转置矩阵,
Figure PCTCN2022111609-appb-000021
表 示Q矩阵的参数矩阵的转置矩阵,W k,E表示相对位置编码下与K矩阵和文本向量相关的参数矩阵,
Figure PCTCN2022111609-appb-000022
表示模态2中第j个元素的文本向量,R i-j表示模态1中第i个元素和模态2中第j个元素之间的相对位置编码向量,W K,R表示相对位置编码下与K矩阵和相对位置编码向量相关的参数矩阵,u T和v T分别表示与模态1中第i个元素的位置无关的、待学习的参数向量,
上述所有可选技术方案,能够采用任意结合形成本公开的可选实施例,在此不再一一赘述。
本申请实施例提供的方法,通过针对文章的文本数据和图片数据,分别提取文本特征和图片特征,并利用两者之间的跨模态交互特征,来预测该文章所属的文章类别,该方法同时考虑了文本模态和图片模态各自对于文章类别的贡献程度,而并非仅从文本角度来进行判断,此外所提取的跨模态交互特征并非是文本特征和图片特征的简单拼接,其能够反映出更加丰富和深层次的模态间交互信息,有助于提高对文章类别的识别准确率,进而在识别优质文章的场景下能够提高对优质文章的挖掘准确率。
图7是本申请实施例提供的一种数据处理方法的流程图。参见图7,该实施例由计算机设备执行,以计算机设备为服务器为例进行说明,针对文章划分了标题、正文和图片三个模态的情况,在本申请实施例中将详细介绍如何基于三模态融合方式识别文章的文章类别,该实施例包括下述步骤:
701、服务器获取文章中的标题数据、正文数据和图片数据。
其中,该标题数据和正文数据可统称为文本数据。
上述步骤701与上述步骤301类似,这里不做赘述。可选地,服务器获取到文本数据和图片数据之后,可以进一步从该文本数据中抽取标题数据和正文数据。
702、服务器提取该标题数据的标题语义特征,将该标题语义特征与标题位置特征融合,得到该标题数据的标题特征。
在一些实施例中,服务器基于标题编码模型来提取该标题语义特征,该标题编码模型用于提取标题数据的标题语义特征,也即是说,服务器将该标题数据输入到标题编码模型中,通过该标题编码模型对该标题数据进行编码,以提取得到该标题语义特征。可选地,该标题编码模型的模型结构包括但不限于:BERT模型、Transformers模型、ELMo模型、NNLM模型等,本申请实施例不对该标题编码模型的模型结构进行具体限定。
示意性地,以该标题编码模型为BERT模型为例说明,假设该标题数据包括至少一条标题,服务器可以对每个标题进行分词处理,得到每个标题中包含的至少一个字符,将各个标题的各个字符按照在其文章中出现的先后顺序可排列形成一个字符序列,在该字符序列中以每个标题的句尾添加[SEP] 作为语句分割符,并在该字符序列的首位增加[CLS]作为分类符,其中,语句分割符用于在相邻的标题之间进行断句,分类符用于表征整个字符序列的全局化语义信息。
将该字符序列输入到BERT模型中,BERT模型包括一个嵌入层和至少一个双向编码层,每个双向编码层用于对输入信号进行正向编码和反向编码,每个双向编码层的输出作为下一个双向编码层的输入,即各个双向编码层之间串联连接。在每个双向编码层中包括两部分,一部分是注意力网络,另一部分是前向全连接层,注意力网络中每一个隐层都是由上一层的隐层进行加权平均所得,使得每一个隐层都能和上一层的所有隐层直接关联,利用输入的长序列信息(也即该字符序列)能够得到一个用于表征全局化信息的隐层向量,而前向全连接层则用于对注意力网络获取的全局化信息进行进一步加工,以增强整个BERT模型的学习能力。
可选地,可以先将该字符序列输入到BERT模型的嵌入层中,通过该嵌入层对该字符序列中各个字符进行嵌入处理,换言之,将各个字符映射到嵌入空间,得到各个字符的嵌入向量,即得到了一个嵌入向量序列。接着,再将该嵌入向量序列输入到该至少一个双向编码层中,通过该至少一个双向编码层对该嵌入向量序列中各个嵌入向量进行双向编码(包括正向编码和反向编码),输出各个字符的语义向量,即得到了一个语义向量序列,最终,由最后一个双向编码层输出该标题数据的标题语义特征。该字符序列中的每个字符对应于该嵌入向量序列中的一个嵌入向量,该嵌入向量序列中的每个嵌入向量对应于该语义向量序列中的一个语义向量。
在上述过程中,通过双向编码层对该嵌入向量序列分别进行正向编码和反向编码,通过正向编码使得每个字符对应的语义向量能够融合该字符之前出现的字符的相关信息,而通过反向编码使得每个字符对应的语义向量能够融合该字符之后出现的字符的相关信息,两个方向的编码操作能够大大提升各个字符的语义向量的表达能力。
以第一个双向编码层为例进行说明,在该双向编码层中包括注意力网络和前向全连接层。将该嵌入向量序列输入到第一个双向编码层的注意力网络中,通过注意力网络对该嵌入向量序列进行加权,以提取该嵌入向量序列的注意力特征序列,将该注意力特征序列输入到第一个双向编码层的前向全连接层中,通过前向全连接层对该注意力特征序列进行双向的语义编码(包括正向编码和反向编码),输出一个隐向量序列,将该隐向量序列输入到第二个双向编码层中,依此类推,后续的双向编码层的处理逻辑均与第一个双向编码层类似,这里不做赘述,最终,由最后一个双向编码层输出该标题数据的标题语义特征。由于在双向编码层中引入注意力机制,能够在每次进行语义编码时,使得各个字符聚焦于与自身关联较大(关系更密切)的字符,使得最终获取的各个字符的语义向量具有更高的准确性。
在一些实施例中,服务器在还可以获取该标题数据的标题位置特征,该标题位置特征用于表征各个字符在标题数据中的位置先后顺序。可选地,对该字符序列中各个字符的位置信息进行编码,得到该标题数据的标题位置特征。接着,将该标题语义特征和标题位置特征进行拼接,得到该标题数据的标题特征。
在一些实施例中,服务器在对各个字符的位置信息进行编码时,可以采用绝对位置编码方式或者相对位置编码方式,本申请实施例对位置信息的编码方式不进行具体限定。两种位置编码方式均在上个实施例中已介绍过,这里不做赘述。
在一些实施例中,如果该标题语义特征与标题位置特征的维度不同,那么标题语义特征和标题位置特征将无法直接拼接,此时可以使用一个1维卷积层对该标题语义特征进行维度变换(即升维或者降维),使得维度变换后的标题语义特征与标题位置特征维度相同,从而将维度变换后的标题语义特征与标题位置特征进行拼接,得到该标题数据的标题特征。其中,该1维卷积层是指卷积核尺寸为1×1的卷积层。
在一些实施例中,除了可以采用拼接方式进行融合之外,也可利用按元素相加、按元素相乘、双线性汇合等方式,来融合该标题语义特征和标题位置特征,本申请实施例不对融合方式进行具体限定。
703、服务器提取该正文数据的正文语义特征,将该正文语义特征与正文位置特征融合,得到该正文数据的正文特征。
在一些实施例中,服务器基于正文编码模型来提取该正文语义特征,该正文编码模型用于提取正文数据的正文语义特征,也即是说,服务器将该正文数据输入到正文编码模型中,通过该正文编码模型对该正文数据进行编码,以提取得到该正文语义特征。可选地,该正文编码模型的模型结构包括但不限于下述任一项或者至少两项的组合:BERT模型、Transformers模型、ELMo模型、NNLM模型等,本申请实施例不对该正文编码模型的模型结构进行具体限定。示意性地,该正文编码模型可以由BERT模型与Transformers模型的编码器级联而成,此种结构的正文编码模型对正文数据的处理过程与上述步骤302中文本编码模型对文本数据的处理过程类似,这里不做赘述。
在一些实施例中,服务器还可以获取该正文数据的正文位置特征,该正文位置特征用于表征各个字符在正文数据中的位置先后顺序。可选地,对该字符序列中各个字符的位置信息进行编码,得到该正文数据的正文位置特征。接着,将该正文语义特征和正文位置特征进行拼接,得到该正文数据的正文特征。
在一些实施例中,服务器在对各个字符的位置信息进行编码时,可以采用绝对位置编码方式或者相对位置编码方式,本申请实施例对位置信息的编码方式不进行具体限定。两种位置编码方式均在上个实施例中已介绍过,这 里不做赘述。
在一些实施例中,如果该正文语义特征与正文位置特征的维度不同,那么正文语义特征和正文位置特征将无法直接拼接,此时可以使用一个1维卷积层对该正文语义特征进行维度变换(即升维或者降维),使得维度变换后的正文语义特征与正文位置特征维度相同,从而将维度变换后的正文语义特征与正文位置特征进行拼接,得到该正文数据的正文特征。其中,该1维卷积层是指卷积核尺寸为1×1的卷积层。
在一些实施例中,除了可以采用拼接方式进行融合之外,也可利用按元素相加、按元素相乘、双线性汇合等方式,来融合该正文语义特征和正文位置特征,本申请实施例不对融合方式进行具体限定。
在上述步骤702-703中,提供了在将文本数据划分为标题数据和正文数据的情况下,服务器提取该文本数据的文本语义特征,将该文本语义特征与文本位置特征进行融合,得到该文本数据的文本特征的可能实施方式,通过将文本数据划分为标题数据和正文数据,能够提取出更多、更丰富的特征信息。
704、服务器提取该图片数据的图片深度特征,将该图片深度特征与图片位置特征融合,得到该图片数据的图片特征。
上述步骤704与上述步骤303类似,这里不做赘述。
应理解,在实际应用中,可以根据实际需求调整上述步骤702、步骤703和步骤704之间是执行顺序,本申请在此不对步骤702、步骤703和步骤704的执行顺序做任何限定。
705、对该标题特征,服务器基于该正文特征和该图片特征中分别与该标题特征关联的部分特征,确定标题交互特征,该标题交互特征用于表征融合了正文特征和图片特征之后的标题特征。
在一些实施例中,服务器基于该正文特征中与该标题特征关联的部分特征,确定第一标题交互特征,也即是说,以标题模态为主模态,接收正文模态的辅助信息。可选地,服务器获取正文特征中与标题特征关联的部分特征,并利用跨模交互模型,对该标题特征和该部分特征进行处理,得到该第一标题交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者Transformers模型的变种,例如,该跨模交互模型可以为跨模Transformers模型,将主模态α确定为标题模态,将辅模态β确定为正文模态,模型结构和处理逻辑与上述步骤304类似,这里不做赘述。
在一些实施例中,服务器基于该图片特征中与该标题特征关联的部分特征,确定第二标题交互特征,也即是说,以标题模态为主模态,接收图片模态的辅助信息。可选地,服务器获取图片特征中与标题特征关联的部分特征,并利用跨模交互模型,对该标题特征和该部分特征进行处理,得到该第二标题交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者 Transformers模型的变种,例如,该跨模交互模型可以为跨模Transformers模型,将主模态α确定为标题模态,将辅模态β确定为图片模态,模型结构和处理逻辑与上述步骤304类似,这里不做赘述。
在一些实施例中,服务器将该第一标题交互特征和该第二标题交互特征拼接,得到第三标题交互特征,可以降低将该第一标题交互特征和该第二标题交互特征融合时的计算复杂度,可选地,也可采取按元素相加、按元素相乘、双线性汇合等融合方式,本申请实施例对此不进行具体限定。
在一些实施例中,服务器对该第三标题交互特征进行编码和解码,得到该标题交互特征。可选地,服务器将该第三标题交互特征输入Transformers模型,Transformers模型包括N个级联的编码器和N个级联的解码器,调用该N个级联的编码器对该第三标题交互特征进行编码,得到中间标题交互特征,将该中间标题交互特征输入到N个级联的解码器中进行解码,得到该标题交互特征。其中,N为大于或等于1的整数,例如N=6或者其他数值。
在一些实施例中,N个级联的编码器中每个编码器内部包括一个多头注意力层和一个前馈神经网络层,多头注意力层用于从多个表达子空间中综合提取标题数据内各字符之间的关联关系,前馈神经网络层用于对多头注意力层输出的特征向量进行全连接,在多头注意力层和前馈神经网络层之后均设置有残差结构,也即将当前层的输入与输出进行残差连接(即拼接)并归一化之后再输入到下一层中。通过N个级联的编码器对输入的向量进行编码,由最后一个编码器输出该中间标题交互特征。
接着,将该中间标题交互特征输入到N个级联的解码器。N个级联的解码器中每个解码器内部包括一个掩码多头注意力层、一个融合多头注意力层和一个前馈神经网络层,掩码多头注意力层与多头注意力层类似,但掩码多头注意力层仅关注当前时刻之前的翻译结果,因此需要对当前时刻之后的翻译结果进行mask(遮挡)处理,而融合多头注意力层也与多头注意力层也类似,但融合多头注意力层除了以本解码器的掩码多头注意力层的输出为输入之外,还以对应序号的编码器的前馈神经网络层的输出(指经过残差连接及归一化的结果)作为输入,这一设计是用于关注编码器的编码信息,换一种说法,解码器通过查看编码器的输出和对其自身输出的自注意力,来预测下一个时刻的交互特征,解码器的前馈神经网络层与编码器的前馈神经网络层类似,这里不做赘述,同理解码器的掩码多头注意力层、融合多头注意力层、前馈神经网络层之后也均设置有残差结构,也即将当前层的输入与输出进行残差连接(即拼接)并归一化之后再输入到下一层中。其中,级联的编码器的数量与级联的解码器的数量需要保持一致。通过N个级联的解码器可以对该中间标题交互特征进行解码,由最后一个解码器输出最终的标题交互特征。
706、对该正文特征,服务器基于该标题特征和该图片特征中分别与该正文特征关联的部分特征,确定正文交互特征,该正文交互特征用于表征融合 了标题特征和图片特征之后的正文特征。
由于在标题、正文、图片三模态融合情况下,第一交互特征包括标题交互特征和正文交互特征,因此步骤705-706示出了如何获取第一交互特征的可能实施方式。
在一些实施例中,服务器基于该标题特征中与该正文特征关联的部分特征,确定第一正文交互特征,也即是说,以正文模态为主模态,接收标题模态的辅助信息。可选地,服务器获取标题特征中与正文特征关联的部分特征,并利用跨模交互模型,对该正文特征和该部分特征进行处理,得到该第一正文交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者Transformers模型的变种,例如,该跨模交互模型可以为跨模Transformers模型,将主模态α确定为正文模态,将辅模态β确定为标题模态,模型结构和处理逻辑与上述步骤304类似,这里不做赘述。
在一些实施例中,服务器基于该图片特征中与该正文特征关联的部分特征,确定第二正文交互特征,也即是说,以正文模态为主模态,接收图片模态的辅助信息。可选地,服务器获取图片特征中与正文特征关联的部分特征,并利用跨模交互模型,对该正文特征和该部分特征进行处理,得到该第二正文交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者Transformers模型的变种,例如,该跨模交互模型可以为跨模Transformers模型,将主模态α确定为正文模态,将辅模态β确定为图片模态,模型结构和处理逻辑与上述步骤304类似,这里不做赘述。
在一些实施例中,服务器将该第一正文交互特征和该第二正文交互特征进行拼接,得到第三正文交互特征,能够降低将该第一正文交互特征和该第二正文交互特征融合时的计算复杂度,可选地,也可采取按元素相加、按元素相乘、双线性汇合等融合方式,本申请实施例对此不进行具体限定。
在一些实施例中,服务器对该第三正文交互特征进行编码和解码,得到该正文交互特征。可选地,服务器将该第三正文交互特征输入Transformers模型,通过Transformers模型中N个级联的编码器对该第三正文交互特征进行编码,得到中间正文交互特征,将该中间正文交互特征输入到N个级联的解码器中进行解码,得到该正文交互特征。其中,N为大于或等于1的整数,例如N=6或者其他数值。Transformers模型的编码器和解码器的内部处理逻辑已在上述步骤705中进行详细说明,这里不做赘述。
707、对该图片特征,服务器基于该标题特征和该正文特征中分别与该图片特征关联的部分特征,确定第二交互特征,该第二交互特征用于表征融合了标题特征和正文特征之后的图片特征。
在一些实施例中,服务器基于该标题特征中与该图片特征关联的部分特征,确定第一图片交互特征,也即是说,以图片模态为主模态,接收标题模态的辅助信息。可选地,服务器获取标题特征中与图片特征关联的部分特征, 并利用跨模交互模型,对该图片特征和该部分特征进行处理,得到该第一图片交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者Transformers模型的变种,例如,该跨模交互模型可以为跨模Transformers模型,将主模态α确定为图片模态,将辅模态β确定为标题模态,模型结构和处理逻辑与上述步骤304类似,这里不做赘述。
在一些实施例中,服务器基于该正文特征中与该图片特征关联的部分特征,确定第二图片交互特征,也即是说,以图片模态为主模态,接收正文模态的辅助信息。可选地,服务器获取正文特征中与图片特征关联的部分特征,并利用跨模交互模型,对该图片特征和该部分特征进行处理,得到该第二图片交互特征。可选地,该跨模交互模型包括但不限于:Transformers模型或者Transformers模型的变种,例如,该跨模交互模型可以为跨模Transformers模型,将主模态α确定为图片模态,将辅模态β确定为正文模态,模型结构和处理逻辑与上述步骤304类似,这里不做赘述。
在一些实施例中,服务器将该第一图片交互特征和该第二图片交互特征进行拼接,得到第三图片交互特征,能够降低将该第一图片交互特征和该第二图片交互特征融合时的计算复杂度,可选地,也可采取按元素相加、按元素相乘、双线性汇合等融合方式,本申请实施例对此不进行具体限定。
在一些实施例中,服务器对该第三图片交互特征进行编码和解码,得到该第二交互特征。可选地,服务器将该第三图片交互特征输入Transformers模型,通过Transformers模型中N个级联的编码器对该第三图片交互特征进行编码,得到中间图片交互特征,将该中间图片交互特征输入到N个级联的解码器中进行解码,得到该第二交互特征。其中,N为大于或等于1的整数,例如N=6或者其他数值。Transformers模型的编码器和解码器的内部处理逻辑已在上述步骤705中进行详细说明,这里不做赘述。
应理解,在实际应用中,可以根据实际需求调整上述步骤705、步骤706和步骤707之间是执行顺序,本申请在此不对步骤705、步骤706和步骤707的执行顺序做任何限定。
708、服务器将该标题交互特征、该正文交互特征和该第二交互特征融合,得到跨模态交互特征。
在一些实施例中,服务器将该标题交互特征、该正文交互特征和该第二交互特征进行拼接,得到最终三模态间的跨模态交互特征,从而降低特征融合时的计算量。
在另一些实施例中,服务器可以通过按元素相加、按元素相乘或者双线性汇合等方式,将该标题交互特征、该正文交互特征和该第二交互特征融合,能够使得特征融合得更加充分,本申请实施例不对特征融合方式进行具体限定。
在上述步骤705-708中,提供了服务器获取跨模态交互特征的一种可能 实施方式,即通过将文本数据划分成标题数据和正文数据,从而将原本的两模态融合扩展成三模态融合,能够充分利用模态间的序列级交互信息,对标题、正文和图片三种模态,两两组合(共6种组合方式)进行有向的跨模态注意力加权,每个模态都会作为主模态接收另外两个模态的辅助信息,大大提升了最终获取的跨模态交互特征的表达能力,使得最终基于跨模态交互特征进行预测时的准确率也大大提升。
709、服务器基于该跨模态交互特征,确定该文章所属的文章类别。
上述步骤709与上述步骤205类似,这里不做赘述。
图8是本申请实施例提供的一种多模态融合网络的原理性示意图,如图8所示,多模态融合网络中包括标题编码模型801、正文编码模型802、图片编码模型803和跨模交互部分804。
示意性地,标题编码模型801为由基础BERT模型微调(Finetune)得到的BERT模型,将标题数据的字符序列(简称为标题序列)输入标题编码模型801,输出标题数据的标题语义特征,将标题语义特征输入一个1维卷积层(Conv1D)进行维度变换后,与标题位置特征进行拼接,得到该标题数据的标题特征。
示意性地,正文编码模型802是由微调得到的BERT模型8021和Transformers模型的编码器8022级联而成的,将正文数据的字符序列(简称为正文句子序列)输入BERT模型8021,输出一个语义向量序列,将该语义向量序列再输入到Transformers模型的编码器8022,输出正文数据的正文语义特征,将正文语义特征输入Conv1D层进行维度变换后,与正文位置特征进行拼接,得到该正文数据的正文特征。
示意性地,图片编码模型803为预训练得到的MobileNet模型,将图片数据的图片序列输入图片编码模型803,输出图片数据的图片深度特征,将图片深度特征输入Conv1D层进行维度变换后,与图片位置特征进行拼接,得到该图片数据的图片特征。
示意性地,跨模交互部分804包括6个跨模Transformers模型和3个基础的Transformers模型。以标题模态为主模态,利用跨模Transformers模型分别提取从正文模态→标题模态的第一标题交互特征,以及从图片模态→标题模态的第二标题交互特征,将该第一标题交互特征和该第二标题交互特征进行拼接,得到第三标题交互特征,将该第三标题交互特征输入Transformers模型进行先编码再解码,输出标题交互特征。此外,以正文模态为主模态,利用跨模Transformers模型分别提取从标题模态→正文模态的第一正文交互特征,以及从图片模态→正文模态的第二正文交互特征,将该第一正文交互特征和该第二正文交互特征进行拼接,得到第三正文交互特征,将该第三正文交互特征输入Transformers模型进行先编码再解码,输出正文交互特征。此外,以图片模态为主模态,利用跨模Transformers模型分别提取从标题模 态→图片模态的第一图片交互特征,以及从正文模态→图片模态的第二图片交互特征,将该第一图片交互特征和该第二图片交互特征进行拼接,得到第三图片交互特征,将该第三图片交互特征输入Transformers模型进行先编码再解码,输出第二交互特征。
进一步地,将该标题交互特征、该正文交互特征和该第二交互特征进行拼接,得到最终三模态间的跨模态交互特征,再利用该跨模态交互特征,预测出文章最终所属的文章类别(Classification)。需要说明的是,在三模态融合的情况下,也可基于与上述实施例中类似的方式引入相对位置编码方式,这里不做赘述。
相较于传统的多模态融合方式来说,由于各模态的采样率不同,在本质上各个模态的数据之间的非对齐的,且不同模态的元素之间是存在长依赖关系的,传统的多模态融合方式无法改善上述两个问题,因此对文章类别的识别准确率低。而上述多模态融合网络,构建了在图文优质识别这一非对齐情况下的跨模态交互方法,针对三模态的跨模交互部分能够充分利用模态间的序列级交互信息,在6种组合方式各自的跨模交互模型融合了两个模态的信息之后,采用基于自注意力的Transformers模型继续结合上下文(Context)进行建模,最后拼接三组特征(该标题交互特征、该正文交互特征和该第二交互特征)进行预测,通过纵向对比实验发现标题、正文、图片三路组合场景下的模型效果最优,也即任意两模态间的交互信息对模型效果都有明显的增强作用。
上述结合相对位置编码方式的多模态融合网络可应用于识别优质图文场景中,这一场景下相邻图文间的模态交互性是至关重要的,同时通过引入相对位置编码方式,能够增强对文本和图片序列间相对位置关系的学习,从而提升整体模型的识别准确率。此外,在自媒体时代,影响文章质量评定的因素繁多,除了文本质量,图片和文本之间的整体搭配效果也是至关重要的,上述结合相对位置编码方式的多模态融合网络,完成了图文优质识别场景中多模态模块的构建。
在对内容中心的图文内容进行质量判定的测试任务中,模型评测准确率达到95%,而传统有监督的识别优质图文手段,如仅从文本角度进行内容质量判定时,或者将文本Embedding和图片Embedding进行简单拼接后进行内容质量判定时,其考虑维度都非常单一,并且无法学习到相邻文本和图片间的模态交互信息,结果为整体准确率低于95%,因此,本申请实施例提供的方法能够大大提升针对文章类别的识别准确率。
此外,在上述测试任务中,图文优质内容的覆盖率达到17%,通过在浏览器侧对识别出来的图文优质内容进行推荐加权实验,实现了将图文搭配效果好、体验优的优质内容优先推荐给用户,并在业务侧相对历史的应用版本取得了良好的业务效果。示意性地,在内容中心的内容处理链路中,对所有 图文内容进行内容质量打分,然后出库并分发给终端侧,终端侧根据内容质量打分分别进行层次化的推荐加权,例如,对识别出来的优质内容进行推荐加权,对低质内容进行推荐降权等。这一推荐方法可以有效提升用户的阅读体验,是一种基于具体业务场景的推荐算法上的创新。
此外,使用本申请实施例所提供的图文先验优质识别算法进行优质内容加权推荐实验后,在浏览器侧整体的点击PV(Page View,页面访问量)提升0.38%,曝光效率提升0.43%,CTR(Click-Through-Rate,点击率)提升0.394%,用户的停留时长提升0.17%;同时DAU(Daily Active User,日活跃用户量)的次日留存提升0.165%,互动指标数据中人均分享提升1.705%,人均点赞提升4.215%,人均评论提升0.188%。
上述所有可选技术方案,能够采用任意结合形成本公开的可选实施例,在此不再一一赘述。
本申请实施例提供的方法,通过针对文章的文本数据和图片数据,分别提取文本特征和图片特征,并利用两者之间的跨模态交互特征,来预测该文章所属的文章类别,该方法同时考虑了文本模态和图片模态各自对于文章类别的贡献程度,而并非仅从文本角度来进行判断,此外所提取的跨模态交互特征并非是文本特征和图片特征的简单拼接,其能够反映出更加丰富和深层次的模态间交互信息,有助于提高对文章类别的识别准确率,进而在识别优质文章的场景下能够提高对优质文章的挖掘准确率。
图9是本申请实施例提供的一种数据处理装置的结构示意图,请参考图9,该装置包括:
第一获取模块901,用于获取文章的文本特征和图片特征,该文本特征用于表征该文章中的文本数据,该图片特征用于表征该文章中的图片数据;
第二获取模块902,用于对该文本特征,基于该图片特征中与该文本特征关联的部分特征,确定第一交互特征,该第一交互特征用于表征融合了图片特征的文本特征;
第三获取模块903,用于对该图片特征,基于该文本特征中与该图片特征关联的部分特征,确定第二交互特征,该第二交互特征用于表征融合了文本特征的图片特征;
融合模块904,用于将该第一交互特征与该第二交互特征融合,得到跨模态交互特征;
确定模块905,用于基于该跨模态交互特征,确定该目标文章所属的文章类别。
本申请实施例提供的装置,通过针对文章的文本数据和图片数据,分别提取文本特征和图片特征,并利用两者之间的跨模态交互特征,来预测该文章所属的文章类别,该方法同时考虑了文本模态和图片模态各自对于文章类 别的贡献程度,而并非仅从文本角度来进行判断,此外提取到的跨模态交互特征并非是文本特征和图片特征的简单拼接,其能够反映出更加丰富和深层次的模态间交互信息,有助于提高对文章类别的识别准确率,进而在识别优质文章的场景下能够提高对优质文章的挖掘准确率。
在一种可能实施方式中,基于图9的装置组成,该第一获取模块901包括:
第一提取融合单元,用于提取该文本数据的文本语义特征,将该文本语义特征与文本位置特征融合,得到该文本特征;
第二提取融合单元,用于提取该图片数据的图片深度特征,将该图片深度特征与图片位置特征融合,得到该图片特征。
在一种可能实施方式中,该文本数据包括标题数据和正文数据;该文本特征包括标题特征和正文特征;
该第一提取融合单元用于:
提取该标题数据的标题语义特征和该正文数据的正文语义特征;
将该标题语义特征与标题位置特征融合,得到该标题特征;
将该正文语义特征与正文位置特征融合,得到该正文特征。
在一种可能实施方式中,该第一交互特征包括标题交互特征和正文交互特征,基于图9的装置组成,该第二获取模块902包括:
第一获取单元,用于对该标题特征,基于该正文特征和该图片特征中分别与该标题特征关联的部分特征,确定该标题交互特征,该标题交互特征用于表征融合了正文特征和图片特征之后的标题特征;
第二获取单元,用于对该正文特征,基于该标题特征和该图片特征中分别与该正文特征关联的部分特征,确定该正文交互特征,该正文交互特征用于表征融合了标题特征和图片特征之后的正文特征。
在一种可能实施方式中,该第一获取单元用于:
基于该正文特征中与该标题特征关联的部分特征,确定第一标题交互特征;
基于该图片特征中与该标题特征关联的部分特征,确定第二标题交互特征;
将该第一标题交互特征和该第二标题交互特征拼接,得到第三标题交互特征;
对该第三标题交互特征进行编码和解码,得到该标题交互特征。
在一种可能实施方式中,该第二获取单元用于:
基于该标题特征中与该正文特征关联的部分特征,确定第一正文交互特征;
基于该图片特征中与该正文特征关联的部分特征,确定第二正文交互特征;
将该第一正文交互特征和该第二正文交互特征拼接,得到第三正文交互特征;
对该第三正文交互特征进行编码和解码,得到该正文交互特征。
在一种可能实施方式中,基于图9的装置组成,该第三获取模块903包括:
第三获取单元,用于对该图片特征,基于该标题特征和该正文特征中分别与该图片特征关联的部分特征,确定该第二交互特征。
在一种可能实施方式中,该第三获取单元用于:
基于该标题特征中与该图片特征关联的部分特征,确定第一图片交互特征;
基于该正文特征中与该图片特征关联的部分特征,确定第二图片交互特征;
将该第一图片交互特征和该第二图片交互特征拼接,得到第三图片交互特征;
对该第三图片交互特征进行编码和解码,得到该第二交互特征。
在一种可能实施方式中,该文本位置特征和该图片位置特征均为该文本数据与该图片数据之间的相对位置特征,该相对位置特征用于表征该文本数据与该图片数据之间的先后顺序和距离远近。
在一种可能实施方式中,该相对位置特征的确定方式包括:
确定该文本数据中的多个文本、以及该图片数据中的多个图片各自在该文章中的位置信息;
基于该位置信息,构建相对位置编码矩阵,该相对位置编码矩阵中的任一元素用于表征该元素所属列对应的文本和该元素所属行对应的图片之间的相对位置信息;
基于该相对位置编码矩阵,确定该多个文本中的任一文本与该多个图片中的任一图片之间的相对位置特征。
在一种可能实施方式中,该确定模块905用于:
对该跨模态交互特征进行全连接处理,得到全连接特征;
对该全连接特征进行指数归一化,得到该文章的概率预测结果;该概率预测结果包括多个预测概率,多个预测概率与多个类别一一对应;
确定符合目标条件的预测概率对应的类别,为该文章所属的文章类别。
上述所有可选技术方案,能够采用任意结合形成本公开的可选实施例,在此不再一一赘述。
需要说明的是:上述实施例提供的数据处理装置在处理数据时,仅以上述各功能模块的划分进行举例说明,实际应用中,能够根据需要而将上述功能分配由不同的功能模块完成,即将计算机设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数 据处理装置与数据处理方法实施例属于同一构思,其具体实现过程详见数据处理方法实施例,这里不再赘述。
图10是本申请实施例提供的一种计算机设备的结构示意图,请参考图10,以计算机设备为终端1000为例进行说明,此时终端1000能够独立完成对文章的文章类别的识别过程。可选地,该终端1000的设备类型包括:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1000还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1000包括有:处理器1001和存储器1002。
可选地,处理器1001包括一个或多个处理核心,比如4核心处理器、8核心处理器等。可选地,处理器1001采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。在一些实施例中,处理器1001包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1001集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1001还包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
在一些实施例中,存储器1002包括一个或多个计算机可读存储介质,可选地,该计算机可读存储介质是非暂态的。可选地,存储器1002还包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1002中的非暂态的计算机可读存储介质用于存储至少一个程序代码,该至少一个程序代码用于被处理器1001所执行以实现本申请中各个实施例提供的数据处理方法。
在一些实施例中,终端1000还可选包括有:外围设备接口1003和至少一个外围设备。处理器1001、存储器1002和外围设备接口1003之间能够通过总线或信号线相连。各个外围设备能够通过总线、信号线或电路板与外围设备接口1003相连。具体地,外围设备包括:射频电路1004、显示屏1005、摄像头组件1006、音频电路1007、定位组件1008和电源1009中的至少一种。
在一些实施例中,终端1000还包括有一个或多个传感器1010。该一个或多个传感器1010包括但不限于:加速度传感器1011、陀螺仪传感器1012、压力传感器1013、指纹传感器1014、光学传感器1015以及接近传感器1016。
本领域技术人员能够理解,图10中示出的结构并不构成对终端1000的限定,能够包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
图11是本申请实施例提供的一种计算机设备的结构示意图,该计算机设备1100可因配置或性能不同而产生比较大的差异,该计算机设备1100包括一个或一个以上处理器(Central Processing Units,CPU)1101和一个或一个以上的存储器1102,其中,该存储器1102中存储有至少一条计算机程序,该至少一条计算机程序由该一个或一个以上处理器1101加载并执行以实现上述各个实施例提供的数据处理方法。可选地,该计算机设备1100还具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该计算机设备1100还包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括至少一条计算机程序的存储器,上述至少一条计算机程序可由终端中的处理器执行以完成上述各个实施例中的数据处理方法。例如,该计算机可读存储介质包括ROM(Read-Only Memory,只读存储器)、RAM(Random-Access Memory,随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory,只读光盘)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,包括一条或多条程序代码,该一条或多条程序代码存储在计算机可读存储介质中。计算机设备的一个或多个处理器能够从计算机可读存储介质中读取该一条或多条程序代码,该一个或多个处理器执行该一条或多条程序代码,使得计算机设备能够执行以完成上述实施例中的数据处理方法。
本领域普通技术人员能够理解实现上述实施例的全部或部分步骤能够通过硬件来完成,也能够通过程序来指令相关的硬件完成,可选地,该程序存储于一种计算机可读存储介质中,可选地,上述提到的存储介质是只读存储器、磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种数据处理方法,由计算机设备执行,所述方法包括:
    获取文章的文本特征和图片特征,所述文本特征用于表征所述文章中的文本数据,所述图片特征用于表征所述文章中的图片数据;
    对所述文本特征,基于所述图片特征中与所述文本特征关联的部分特征,确定第一交互特征,所述第一交互特征用于表征融合了图片特征的文本特征;
    对所述图片特征,基于所述文本特征中与所述图片特征关联的部分特征,确定第二交互特征,所述第二交互特征用于表征融合了文本特征的图片特征;
    将所述第一交互特征与所述第二交互特征融合,得到跨模态交互特征;
    基于所述跨模态交互特征,确定所述文章所属的文章类别。
  2. 根据权利要求1所述的方法,所述获取文章的文本特征和图片特征,包括:
    提取所述文本数据的文本语义特征,将所述文本语义特征与文本位置特征融合,得到所述文本特征;
    提取所述图片数据的图片深度特征,将所述图片深度特征与图片位置特征融合,得到所述图片特征。
  3. 根据权利要求2所述的方法,所述文本数据包括标题数据和正文数据;所述文本特征包括标题特征和正文特征;
    所述提取所述文本数据的文本语义特征,包括:
    提取所述标题数据的标题语义特征和所述正文数据的正文语义特征;
    所述将所述文本语义特征与文本位置特征融合,得到所述文本特征,包括:
    将所述标题语义特征与标题位置特征融合,得到所述标题特征;
    将所述正文语义特征与正文位置特征融合,得到所述正文特征。
  4. 根据权利要求3所述的方法,所述第一交互特征包括标题交互特征和正文交互特征,所述对所述文本特征,基于所述图片特征中与所述文本特征关联的部分特征,确定第一交互特征,包括:
    对所述标题特征,基于所述正文特征和所述图片特征中分别与所述标题特征关联的部分特征,确定所述标题交互特征,所述标题交互特征用于表征融合了正文特征和图片特征之后的标题特征;
    对所述正文特征,基于所述标题特征和所述图片特征中分别与所述正文特征关联的部分特征,确定所述正文交互特征,所述正文交互特征用于表征融合了标题特征和图片特征之后的正文特征。
  5. 根据权利要求4所述的方法,所述基于所述正文特征和所述图片特征中分别与所述标题特征关联的部分特征,确定所述标题交互特征,包括:
    基于所述正文特征中与所述标题特征关联的部分特征,确定第一标题交互特征;
    基于所述图片特征中与所述标题特征关联的部分特征,确定第二标题交互特征;
    将所述第一标题交互特征和所述第二标题交互特征拼接,得到第三标题交互特征;
    对所述第三标题交互特征进行编码和解码,得到所述标题交互特征。
  6. 根据权利要求4所述的方法,所述基于所述标题特征和所述图片特征中分别与所述正文特征关联的部分特征,确定所述正文交互特征,包括:
    基于所述标题特征中与所述正文特征关联的部分特征,确定第一正文交互特征;
    基于所述图片特征中与所述正文特征关联的部分特征,确定第二正文交互特征;
    将所述第一正文交互特征和所述第二正文交互特征拼接,得到第三正文交互特征;
    对所述第三正文交互特征进行编码和解码,得到所述正文交互特征。
  7. 根据权利要求3所述的方法,所述对所述图片特征,基于所述文本特征中与所述图片特征关联的部分特征,确定第二交互特征,包括:
    对所述图片特征,基于所述标题特征和所述正文特征中分别与所述图片特征关联的部分特征,确定所述第二交互特征。
  8. 根据权利要求7所述的方法,所述基于所述标题特征和所述正文特征中分别与所述图片特征关联的部分特征,确定所述第二交互特征,包括:
    基于所述标题特征中与所述图片特征关联的部分特征,确定第一图片交互特征;
    基于所述正文特征中与所述图片特征关联的部分特征,确定第二图片交互特征;
    将所述第一图片交互特征和所述第二图片交互特征拼接,得到第三图片交互特征;
    对所述第三图片交互特征进行编码和解码,得到所述第二交互特征。
  9. 根据权利要求2所述的方法,所述文本位置特征和所述图片位置特征均为所述文本数据与所述图片数据之间的相对位置特征,所述相对位置特征用于表征所述文本数据与所述图片数据之间的先后顺序和距离远近。
  10. 根据权利要求9所述的方法,所述相对位置特征的确定方式包括:
    确定所述文本数据中的多个文本、以及所述图片数据中的多个图片各自在所述文章中的位置信息;
    基于所述位置信息,构建相对位置编码矩阵,所述相对位置编码矩阵中的任一元素用于表征所述元素所属列对应的文本和所述元素所属行对应的图片之间的相对位置信息;
    基于所述相对位置编码矩阵,确定所述多个文本中的任一文本与所述多 个图片中的任一图片之间的相对位置特征。
  11. 根据权利要求1所述的方法,所述基于所述跨模态交互特征,确定所述文章所属的文章类别,包括:
    对所述跨模态交互特征进行全连接处理,得到全连接特征;
    对所述全连接特征进行指数归一化,得到所述文章的概率预测结果;所述概率预测结果包括多个预测概率,所述多个预测概率与多个类别一一对应;
    确定符合目标条件的预测概率对应的类别,为所述文章所属的文章类别。
  12. 一种数据处理装置,所述装置包括:
    第一获取模块,用于获取文章的文本特征和图片特征,所述文本特征用于表征所述文章中的文本数据,所述图片特征用于表征所述文章中的图片数据;
    第二获取模块,用于对所述文本特征,基于所述图片特征中与所述文本特征关联的部分特征,确定第一交互特征,所述第一交互特征用于表征融合了图片特征的文本特征;
    第三获取模块,用于对所述图片特征,基于所述文本特征中与所述图片特征关联的部分特征,确定第二交互特征,所述第二交互特征用于表征融合了文本特征的图片特征;
    融合模块,用于将所述第一交互特征与所述第二交互特征融合,得到跨模态交互特征;
    确定模块,用于基于所述跨模态交互特征,确定所述文章所属的文章类别。
  13. 一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求11任一项所述的数据处理方法。
  14. 一种存储介质,所述存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行以实现如权利要求1至权利要求11任一项所述的数据处理方法。
  15. 一种计算机程序产品,所述计算机程序产品包括至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行以实现如权利要求1至权利要求11任一项所述的数据处理方法。
PCT/CN2022/111609 2021-09-22 2022-08-11 数据处理方法、装置、计算机设备及存储介质 Ceased WO2023045605A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22871663.5A EP4310695A4 (en) 2021-09-22 2022-08-11 DATA PROCESSING METHOD AND APPARATUS, COMPUTER DEVICE AND STORAGE MEDIUM
US18/232,098 US20230386238A1 (en) 2021-09-22 2023-08-09 Data processing method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111106186.5A CN115858826B (zh) 2021-09-22 2021-09-22 数据处理方法、装置、计算机设备及存储介质
CN202111106186.5 2021-09-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/232,098 Continuation US20230386238A1 (en) 2021-09-22 2023-08-09 Data processing method and apparatus, computer device, and storage medium

Publications (2)

Publication Number Publication Date
WO2023045605A1 true WO2023045605A1 (zh) 2023-03-30
WO2023045605A9 WO2023045605A9 (zh) 2024-09-12

Family

ID=85652134

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/111609 Ceased WO2023045605A1 (zh) 2021-09-22 2022-08-11 数据处理方法、装置、计算机设备及存储介质

Country Status (4)

Country Link
US (1) US20230386238A1 (zh)
EP (1) EP4310695A4 (zh)
CN (1) CN115858826B (zh)
WO (1) WO2023045605A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189193A (zh) * 2023-04-25 2023-05-30 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置
CN116716079A (zh) * 2023-06-14 2023-09-08 山东沃赛新材料科技有限公司 高性能防霉型醇型美容收边胶及其制备方法
CN117173483A (zh) * 2023-09-15 2023-12-05 科大讯飞股份有限公司 物体识别方法、装置、设备及存储介质
CN117611245A (zh) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 用于电商运营活动策划的数据分析管理系统及方法
CN117708507A (zh) * 2024-02-05 2024-03-15 成都麦特斯科技有限公司 一种基于人工智能的高效α和β射线的识别与分类方法
CN119784679A (zh) * 2024-11-26 2025-04-08 深圳创景数科信息技术有限公司 基于多模态模型的纺织面料识别方法、装置、设备及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591664B (zh) * 2023-04-25 2024-12-13 上海任意门科技有限公司 一种数据挖掘方法、装置及存储介质
CN116702094B (zh) * 2023-08-01 2023-12-22 国家计算机网络与信息安全管理中心 一种群体应用偏好特征表示方法
CN118155037B (zh) * 2024-05-09 2024-07-30 汕头大学医学院 一种基于注意力机制的多模态特征融合方法及系统
US12554491B2 (en) * 2024-05-13 2026-02-17 Microsoft Technology Licensing, Llc Vector processor tile array with input and output streams
CN118887229A (zh) * 2024-05-31 2024-11-01 中山大学肿瘤防治中心(中山大学附属肿瘤医院、中山大学肿瘤研究所) 基于多模态肿瘤图像的图像分割方法、装置、终端设备及存储介质
CN119379787B (zh) * 2024-09-26 2025-11-18 中国科学院自动化研究所 视觉特征与文本特征融合的位置编码方法、系统及装置
CN118965279B (zh) * 2024-10-15 2024-12-13 北京网智天元大数据科技有限公司 一种基于大模型的金融内容风控方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816039A (zh) * 2019-01-31 2019-05-28 深圳市商汤科技有限公司 一种跨模态信息检索方法、装置和存储介质
CN111985369A (zh) * 2020-08-07 2020-11-24 西北工业大学 基于跨模态注意力卷积神经网络的课程领域多模态文档分类方法
US20210081729A1 (en) * 2019-09-16 2021-03-18 Beijing Baidu Netcom Science Technology Co., Ltd. Method for image text recognition, apparatus, device and storage medium
CN112559683A (zh) * 2020-12-11 2021-03-26 苏州元启创人工智能科技有限公司 基于多模态数据及多交互记忆网络的方面级情感分析方法
CN112784092A (zh) * 2021-01-28 2021-05-11 电子科技大学 一种混合融合模型的跨模态图像文本检索方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SG155922A1 (en) * 2006-04-05 2009-10-29 Agency Science Tech & Res Apparatus and method for analysing a video broadcast
JP2013235507A (ja) * 2012-05-10 2013-11-21 Mynd Inc 情報処理方法、装置、コンピュータプログラムならびに記録媒体
CN110795657B (zh) * 2019-09-25 2023-10-27 腾讯科技(深圳)有限公司 文章推送及模型训练方法、装置、存储介质和计算机设备
CN112231497B (zh) * 2020-10-19 2024-04-09 腾讯科技(深圳)有限公司 信息分类方法、装置、存储介质及电子设备
CN112989097A (zh) * 2021-03-23 2021-06-18 北京百度网讯科技有限公司 模型训练、图片检索方法及装置
CN113297485B (zh) * 2021-05-24 2023-01-24 中国科学院计算技术研究所 一种生成跨模态的表示向量的方法以及跨模态推荐方法
CN116434000A (zh) * 2023-02-17 2023-07-14 京东科技控股股份有限公司 模型训练及物品分类方法、装置、存储介质及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816039A (zh) * 2019-01-31 2019-05-28 深圳市商汤科技有限公司 一种跨模态信息检索方法、装置和存储介质
US20210081729A1 (en) * 2019-09-16 2021-03-18 Beijing Baidu Netcom Science Technology Co., Ltd. Method for image text recognition, apparatus, device and storage medium
CN111985369A (zh) * 2020-08-07 2020-11-24 西北工业大学 基于跨模态注意力卷积神经网络的课程领域多模态文档分类方法
CN112559683A (zh) * 2020-12-11 2021-03-26 苏州元启创人工智能科技有限公司 基于多模态数据及多交互记忆网络的方面级情感分析方法
CN112784092A (zh) * 2021-01-28 2021-05-11 电子科技大学 一种混合融合模型的跨模态图像文本检索方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4310695A4

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189193A (zh) * 2023-04-25 2023-05-30 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置
CN116189193B (zh) * 2023-04-25 2023-11-10 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置
CN116716079A (zh) * 2023-06-14 2023-09-08 山东沃赛新材料科技有限公司 高性能防霉型醇型美容收边胶及其制备方法
CN116716079B (zh) * 2023-06-14 2024-01-19 山东沃赛新材料科技有限公司 高性能防霉型醇型美容收边胶及其制备方法
CN117173483A (zh) * 2023-09-15 2023-12-05 科大讯飞股份有限公司 物体识别方法、装置、设备及存储介质
CN117611245A (zh) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 用于电商运营活动策划的数据分析管理系统及方法
CN117611245B (zh) * 2023-12-14 2024-05-31 浙江博观瑞思科技有限公司 用于电商运营活动策划的数据分析管理系统及方法
CN117708507A (zh) * 2024-02-05 2024-03-15 成都麦特斯科技有限公司 一种基于人工智能的高效α和β射线的识别与分类方法
CN117708507B (zh) * 2024-02-05 2024-04-26 成都麦特斯科技有限公司 一种基于人工智能的高效α和β射线的识别与分类方法
CN119784679A (zh) * 2024-11-26 2025-04-08 深圳创景数科信息技术有限公司 基于多模态模型的纺织面料识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115858826A (zh) 2023-03-28
WO2023045605A9 (zh) 2024-09-12
CN115858826B (zh) 2025-10-03
EP4310695A4 (en) 2024-10-16
US20230386238A1 (en) 2023-11-30
EP4310695A1 (en) 2024-01-24

Similar Documents

Publication Publication Date Title
WO2023045605A1 (zh) 数据处理方法、装置、计算机设备及存储介质
CN113627447B (zh) 标签识别方法、装置、计算机设备、存储介质及程序产品
US9807473B2 (en) Jointly modeling embedding and translation to bridge video and language
CN110717017A (zh) 一种处理语料的方法
CN114357973A (zh) 意图识别方法、装置、电子设备及存储介质
Xiao et al. User preference mining based on fine-grained sentiment analysis
WO2021204017A1 (zh) 文本意图识别方法、装置以及相关设备
WO2020244475A1 (zh) 用于语言序列标注的方法、装置、存储介质及计算设备
CN112101042B (zh) 文本情绪识别方法、装置、终端设备和存储介质
CN114281996B (zh) 长文本分类方法、装置、设备及存储介质
CN113704466B (zh) 基于迭代网络的文本多标签分类方法、装置及电子设备
CN110275963A (zh) 用于输出信息的方法和装置
CN115588122A (zh) 一种基于多模态特征融合的新闻分类方法
CN112487827A (zh) 问题回答方法及电子设备、存储装置
CN116340502A (zh) 基于语义理解的信息检索方法和装置
CN113822065B (zh) 关键词召回方法、装置、电子设备以及存储介质
Ostendorf Continuous-space language processing: Beyond word embeddings
CN109902155B (zh) 多模态对话状态处理方法、装置、介质及计算设备
CN119621954A (zh) 一种基于ai识别用户意图的语义搜索方法和存储介质
CN120409657B (zh) 多模态大模型驱动的人物知识图谱构建方法及系统
CN115391542A (zh) 分类模型的训练方法、文本分类方法、装置及设备
CN112732913B (zh) 一种非均衡样本的分类方法、装置、设备及存储介质
CN115062136A (zh) 基于图神经网络的事件消歧方法及其相关设备
CN114722832A (zh) 一种摘要提取方法、装置、设备以及存储介质
CN114510942A (zh) 获取实体词的方法、模型的训练方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871663

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022871663

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022871663

Country of ref document: EP

Effective date: 20231016

NENP Non-entry into the national phase

Ref country code: DE